pith. machine review for the scientific record. sign in

arxiv: 2605.12953 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: unknown

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords language-guided segmentationtraining-freemultimodal chain-of-reasoningSet-of-Mark promptingMLLMSAMvisual reasoning loop
0
0 comments X

The pith

Seg-Agent lets off-the-shelf multimodal LLMs segment images from language instructions by running an iterative visual reasoning loop over marked regions on the image itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seg-Agent, a training-free system for language-guided segmentation that replaces the usual two-stage pipeline with an explicit multimodal chain-of-reasoning loop. In each cycle the method renders candidate regions onto the image using Set-of-Mark visual prompts, lets the MLLM observe the spatial layout directly, then selects and refines the mask in successive passes. This visual feedback loop removes the need for any parameter updates or large-scale training data while matching the accuracy of supervised methods on a new benchmark spanning semantic, generic-object, and reasoning-guided tasks.

Core claim

Explicit Multimodal Chain-of-Reasoning, built from generation, selection, and refinement stages that operate on Set-of-Mark visual prompts, enables an unmodified MLLM to perform accurate language-guided segmentation by reasoning directly in the visual domain rather than solely in text.

What carries the argument

Iterative visual reasoning loop that renders candidate regions with Set-of-Mark prompts so the MLLM can see and iteratively refine spatial selections on the image.

If this is right

  • Segmentation performance on arbitrary language instructions becomes available at inference time with no dataset collection or model updates.
  • The same visual-loop mechanism can be applied to any foundational segmentation model that accepts point or box prompts.
  • A single benchmark covering explicit semantic, generic-object, and reasoning-guided cases now exists to measure generalization without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that adding visual feedback loops can improve spatial grounding in other MLLM tasks such as referring expression comprehension or visual question answering.
  • Because the method never updates weights, it can be combined with future stronger base MLLMs or segmentation models without additional training cost.
  • The introduced benchmark offers a standardized way to compare training-free and training-based methods across diverse instruction types.

Load-bearing premise

That an off-the-shelf multimodal LLM can correctly interpret and act on visual marks overlaid on the image to select and refine regions without any fine-tuning.

What would settle it

A controlled test in which the MLLM, shown the same image with Set-of-Mark prompts, repeatedly fails to select the region matching the language instruction and produces masks whose IoU remains below the supervised baseline.

Figures

Figures reproduced from arXiv: 2605.12953 by Chao Hao, Guangcong Wang, Ji Du, Jun Xu, Shuo Ye, Xiaodong Cun, Xubin Zheng, Zitong Yu, Ziyue Qiao.

Figure 1
Figure 1. Figure 1: Comparison of different reasoning paradigms for language-guided segmentation, . (a) Directly Output: Training-based methods where MLLM directly generates visual prompt without reasoning. (b) Text-only Chain-of￾Reasoning: Training-based methods where MLLM performs reasoning in the textual domain before outputting coordinates. (c) Our Seg-Agent: We construct an explicit multimodal reasoning chain (generation… view at source ↗
Figure 2
Figure 2. Figure 2: Given an image and a textual target, Seg-Agent can handle segmentation tasks across various scenarios: (a) Explicit Semantic Segmentation: segmenting objects with clearly defined semantics (e.g., “cat”, “dog”). (b) Generic Object Segmentation: segmenting conceptually defined objects without specific categories (e.g., “camouflaged object”, “salient object”). (c) Reasoning-Guided Segmentation: segmenting tar… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Seg-Agent. By constructing an explicit multimodal reasoning chain: generation, selection, and refinement, the MLLM is able to improve the quality of generated visual prompts, thereby enabling SAM to produce more accurate target segmentation masks. SoM here indicates Set-of-Mark prompt [21]. invoke a final reasoning step using another refinement prompt (denoted as <refinement prompt>) to ref… view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison between Seg-Agent and related methods. We show three common scenarios of language￾guided segmentation. mentioned, the target objects in this dataset are de￾scribed using reasoning-based textual expressions, cate￾gorized into long queries and short queries, falling un￾der the reasoning-guided segmentation scenario. Com￾pared to explicit semantic segmentation, this task re￾quires reasoning … view at source ↗
Figure 5
Figure 5. Figure 5: More visualization results of Seg-Agent. Seg-Agent can handle language inputs in various forms, including both Chinese and English. It is also capable of processing different types of images, such as real-world photos, captured photographs, cartoon images, and AI-generated images. These images demonstrate Seg-Agent’s strong generalization ability and its broad range of application scenarios. Please zoom in… view at source ↗
read the original abstract

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Seg-Agent, a training-free framework for language-guided segmentation that implements explicit multimodal chain-of-reasoning via a three-stage interactive loop (generation, selection, refinement) on off-the-shelf MLLMs. It uses Set-of-Mark (SoM) visual prompting to render candidate regions directly on the image, enabling the MLLM to perform spatial reasoning in the visual domain rather than text-only. The central claim is that this approach achieves performance comparable to state-of-the-art training-based methods without any parameter updates. The paper also presents Various-LangSeg, a new benchmark spanning explicit semantic, generic object, and reasoning-guided segmentation tasks, with experiments demonstrating effectiveness and robustness.

Significance. If the quantitative results and ablations hold, the work would be significant for demonstrating that unmodified MLLMs can achieve competitive language-guided segmentation through test-time visual feedback loops, reducing reliance on large-scale supervised training. The Various-LangSeg benchmark could become a useful standard for evaluating generalization across diverse language-guided scenarios.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'performance comparable to state-of-the-art training-based methods' is asserted without any quantitative numbers, tables, or error analysis in the provided text; this is load-bearing because the training-free comparability rests entirely on unshown results.
  2. [§3] §3 (Method): the three-stage visual reasoning loop is presented as correcting the acknowledged limited spatial grounding of off-the-shelf MLLMs via SoM, yet no ablation isolates whether selection/refinement steps reduce grounding errors versus propagating them on reasoning-guided cases; this directly undermines the weakest assumption that unmodified MLLMs can reliably execute iterative spatial selection without adaptation.
minor comments (2)
  1. [§4.1] The description of Various-LangSeg would benefit from explicit dataset statistics, construction protocol, and split details to allow reproducibility.
  2. [§3] Notation for the generation-selection-refinement stages could be formalized with pseudocode or equations to clarify the exact interaction with the MLLM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'performance comparable to state-of-the-art training-based methods' is asserted without any quantitative numbers, tables, or error analysis in the provided text; this is load-bearing because the training-free comparability rests entirely on unshown results.

    Authors: The full manuscript includes Section 4 with quantitative tables, comparisons to SOTA training-based methods, and error analyses that support the abstract claim. We will revise to add explicit cross-references from the abstract and §3 to the specific tables and figures in §4. revision: partial

  2. Referee: [§3] §3 (Method): the three-stage visual reasoning loop is presented as correcting the acknowledged limited spatial grounding of off-the-shelf MLLMs via SoM, yet no ablation isolates whether selection/refinement steps reduce grounding errors versus propagating them on reasoning-guided cases; this directly undermines the weakest assumption that unmodified MLLMs can reliably execute iterative spatial selection without adaptation.

    Authors: We agree that a targeted ablation isolating grounding error reduction across stages (especially on reasoning-guided cases) would strengthen the paper. We will add this analysis in the revised version, measuring error rates before/after selection and refinement on the relevant subset of Various-LangSeg. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework with no derivation chain or fitted parameters

full rationale

The paper describes Seg-Agent as a training-free, procedural framework that applies off-the-shelf MLLMs to SoM-rendered images in a three-stage generation-selection-refinement loop. No equations, parameters, or mathematical derivations are present that could reduce to self-definition, fitted inputs called predictions, or self-citation load-bearing steps. The central claim of comparable performance rests on external benchmark results (Various-LangSeg) rather than any internal reduction to the method's own inputs. This is the expected non-finding for a methods paper whose reasoning chain is entirely empirical and non-derivational.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that current MLLMs possess sufficient visual grounding once candidate regions are explicitly marked on the image.

axioms (1)
  • domain assumption Off-the-shelf MLLMs can perform reliable spatial selection and iterative refinement when shown Set-of-Mark visual prompts
    This is the key premise that allows the method to remain training-free.

pith-pipeline@v0.9.0 · 5602 in / 1110 out tokens · 35395 ms · 2026-05-14T19:44:19.960359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 46 canonical work pages · 8 internal anchors

  1. [1]

    Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Schwing, and Alexander Kirillov

    Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. 2021

  5. [5]

    Schwing, Alexander Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. 2022

  6. [6]

    Seg- ment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Seg- ment anything. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4015–4026, 2023

  7. [7]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  8. [8]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

  9. [9]

    Lisa: Reasoning segmen- tation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmen- tation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

  10. [10]

    SegLLM: Multi-round reasoning segmen- tation with large language models

    XuDong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Kon- stantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmen- tation with large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  11. [11]

    Text4seg: Reimagining image segmentation as text generation, 2024

    Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation, 2024

  12. [12]

    Segformer: Simple and effi- cient design for semantic segmentation with transformers

    EnzeXie,WenhaiWang,ZhidingYu,AnimaAnandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. InNeural Information Processing Systems (NeurIPS), 2021

  13. [13]

    Torr, and Li Zhang

    Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking seman- tic segmentation from a sequence-to-sequence perspective with transformers. InCVPR, 2021

  14. [14]

    Gres: Generalized referring expression segmentation, 2023

    Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation, 2023

  15. [15]

    A simple yet effective network based on vision transformer for camouflaged object and salient object detection.IEEE Transactions on Image Processing, 2025

    Chao Hao, Zitong Yu, Xin Liu, Jun Xu, Huanjing Yue, and Jingyu Yang. A simple yet effective network based on vision transformer for camouflaged object and salient object detection.IEEE Transactions on Image Processing, 2025

  16. [16]

    Yuv20k: A complexity-driven benchmark and trajectory-aware align- ment model for video camouflaged object detection, 2026

    Yiyu Liu, Shuo Ye, Chao Hao, and Zitong Yu. Yuv20k: A complexity-driven benchmark and trajectory-aware align- ment model for video camouflaged object detection, 2026

  17. [17]

    Distribution-specific learning for joint salient and camou- flaged object detection, 2025

    Chao Hao, Zitong Yu, Xin Liu, Yuhao Wang, We- icheng Xie, Jingang Shi, Huanjing Yue, and Jingyu Yang. Distribution-specific learning for joint salient and camou- flaged object detection, 2025

  18. [18]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

  19. [19]

    Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  20. [20]

    arXiv preprint arXiv:2503.06520 (2025)

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  21. [21]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  22. [22]

    ReferItGame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in pho- tographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors,Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October 2014. Association for Comp...

  23. [23]

    Generation andcomprehensionofunambiguousobjectdescriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation andcomprehensionofunambiguousobjectdescriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 11–20. IEEE, June 2016

  24. [24]

    Androidworld: A dynamic benchmarking environment for autonomous agents, 2024

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024

  25. [25]

    Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024

  26. [26]

    Lawrence Zitnick, Dhruv Batra, and Devi Parikh

    Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering, 2016

  27. [27]

    Deep learning approaches on image captioning: A review.ACM Computing Surveys, 56(3):1–39, October 2023

    Taraneh Ghandi, Hamidreza Pourreza, and Hamidreza Mahyar. Deep learning approaches on image captioning: A review.ACM Computing Surveys, 56(3):1–39, October 2023

  28. [28]

    Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

    Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024. Seg-Agent 13

  29. [29]

    Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks, 2024

    Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, and Jifeng Dai. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks, 2024

  30. [30]

    Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in neural infor- mation processing systems, 37:71737–71767, 2024

    TaoZhang,XiangtaiLi,HaoFei,HaoboYuan,Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in neural infor- mation processing systems, 37:71737–71767, 2024

  31. [31]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018

  32. [32]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  33. [33]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

  34. [34]

    Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip H. S. Torr. Lavt: Language-aware vision transformer for referring image segmentation, 2022

  35. [35]

    arXiv preprint arXiv:2501.04001 , year=

    HaoboYuan,XiangtaiLi,TaoZhang,ZilongHuang,Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

  36. [36]

    Gsva: Generalized segmenta- tion via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmenta- tion via multimodal large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June 2024

  37. [37]

    MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmen- tation

    Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmen- tation. InThe Thirteenth International Conference on Learning Representations, 2025

  38. [38]

    Simultaneously localize, segment and rank the camouflaged objects

    Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11591–11601, 2021

  39. [39]

    The secrets of salient object segmentation

    Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and Alan L Yuille. The secrets of salient object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 280–287, 2014

  40. [40]

    Large-scale training of shadow detectors with noisily-annotated shadow examples

    Tomás F Yago Vicente, Le Hou, Chen-Ping Yu, Minh Hoai, and Dimitris Samaras. Large-scale training of shadow detectors with noisily-annotated shadow examples. In Computer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 816–832. Springer, 2016

  41. [41]

    Casia image tampering detection evaluation database

    Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database. In2013 IEEE China Summit and International Conference on Signal and Information Processing, pages 422–426, 2013

  42. [42]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Mi- crosoft coco: Common objects in context, 2015

  43. [43]

    Salient object detection: A benchmark.IEEE transactions on image processing, 24(12):5706–5722, 2015

    Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark.IEEE transactions on image processing, 24(12):5706–5722, 2015

  44. [44]

    Camouflagedobject detection

    Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng,JianbingShen,andLingShao. Camouflagedobject detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020

  45. [45]

    Cris: Clip- driven referring image segmentation

    ZhaoqingWang,YuLu,QiangLi,XunqiangTao,Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip- driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

  46. [46]

    Open-vocabulary semantic segmen- tation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yi- nan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmen- tation with mask-adapted clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023

  47. [47]

    Generalized decoding for pixel, image, and language, 2022

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022

  48. [48]

    Segment everything everywhere all at once, 2023

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once, 2023

  49. [49]

    Perceptiongpt: Effectively fusing visual perception into llm

    Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27124–27133, June 2024

  50. [50]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  51. [51]

    Internvl3: Ex- ploring advanced training and test-time recipes for open- source multimodal models, 2025

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

  52. [52]

    Think before you segment: High-quality reasoning seg- mentation with gpt chain of thoughts.arXiv preprint arXiv:2503.07503, 2025

    Shiu-hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Think before you segment: High-quality reasoning seg- mentation with gpt chain of thoughts.arXiv preprint arXiv:2503.07503, 2025

  53. [53]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025