arxiv: 2605.12953 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI

Recognition: unknown

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao , Jun Xu , Ji Du , Shuo Ye , Ziyue Qiao , Xiaodong Cun , Guangcong Wang , Xubin Zheng

show 1 more author

Zitong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords language-guided segmentationtraining-freemultimodal chain-of-reasoningSet-of-Mark promptingMLLMSAMvisual reasoning loop

0 comments

The pith

Seg-Agent lets off-the-shelf multimodal LLMs segment images from language instructions by running an iterative visual reasoning loop over marked regions on the image itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seg-Agent, a training-free system for language-guided segmentation that replaces the usual two-stage pipeline with an explicit multimodal chain-of-reasoning loop. In each cycle the method renders candidate regions onto the image using Set-of-Mark visual prompts, lets the MLLM observe the spatial layout directly, then selects and refines the mask in successive passes. This visual feedback loop removes the need for any parameter updates or large-scale training data while matching the accuracy of supervised methods on a new benchmark spanning semantic, generic-object, and reasoning-guided tasks.

Core claim

Explicit Multimodal Chain-of-Reasoning, built from generation, selection, and refinement stages that operate on Set-of-Mark visual prompts, enables an unmodified MLLM to perform accurate language-guided segmentation by reasoning directly in the visual domain rather than solely in text.

What carries the argument

Iterative visual reasoning loop that renders candidate regions with Set-of-Mark prompts so the MLLM can see and iteratively refine spatial selections on the image.

If this is right

Segmentation performance on arbitrary language instructions becomes available at inference time with no dataset collection or model updates.
The same visual-loop mechanism can be applied to any foundational segmentation model that accepts point or box prompts.
A single benchmark covering explicit semantic, generic-object, and reasoning-guided cases now exists to measure generalization without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests that adding visual feedback loops can improve spatial grounding in other MLLM tasks such as referring expression comprehension or visual question answering.
Because the method never updates weights, it can be combined with future stronger base MLLMs or segmentation models without additional training cost.
The introduced benchmark offers a standardized way to compare training-free and training-based methods across diverse instruction types.

Load-bearing premise

That an off-the-shelf multimodal LLM can correctly interpret and act on visual marks overlaid on the image to select and refine regions without any fine-tuning.

What would settle it

A controlled test in which the MLLM, shown the same image with Set-of-Mark prompts, repeatedly fails to select the region matching the language instruction and produces masks whose IoU remains below the supervised baseline.

Figures

Figures reproduced from arXiv: 2605.12953 by Chao Hao, Guangcong Wang, Ji Du, Jun Xu, Shuo Ye, Xiaodong Cun, Xubin Zheng, Zitong Yu, Ziyue Qiao.

**Figure 1.** Figure 1: Comparison of different reasoning paradigms for language-guided segmentation, . (a) Directly Output: Training-based methods where MLLM directly generates visual prompt without reasoning. (b) Text-only Chain-ofReasoning: Training-based methods where MLLM performs reasoning in the textual domain before outputting coordinates. (c) Our Seg-Agent: We construct an explicit multimodal reasoning chain (generation… view at source ↗

**Figure 2.** Figure 2: Given an image and a textual target, Seg-Agent can handle segmentation tasks across various scenarios: (a) Explicit Semantic Segmentation: segmenting objects with clearly defined semantics (e.g., “cat”, “dog”). (b) Generic Object Segmentation: segmenting conceptually defined objects without specific categories (e.g., “camouflaged object”, “salient object”). (c) Reasoning-Guided Segmentation: segmenting tar… view at source ↗

**Figure 3.** Figure 3: Illustration of Seg-Agent. By constructing an explicit multimodal reasoning chain: generation, selection, and refinement, the MLLM is able to improve the quality of generated visual prompts, thereby enabling SAM to produce more accurate target segmentation masks. SoM here indicates Set-of-Mark prompt [21]. invoke a final reasoning step using another refinement prompt (denoted as <refinement prompt>) to ref… view at source ↗

**Figure 4.** Figure 4: Visual comparison between Seg-Agent and related methods. We show three common scenarios of languageguided segmentation. mentioned, the target objects in this dataset are described using reasoning-based textual expressions, categorized into long queries and short queries, falling under the reasoning-guided segmentation scenario. Compared to explicit semantic segmentation, this task requires reasoning … view at source ↗

**Figure 5.** Figure 5: More visualization results of Seg-Agent. Seg-Agent can handle language inputs in various forms, including both Chinese and English. It is also capable of processing different types of images, such as real-world photos, captured photographs, cartoon images, and AI-generated images. These images demonstrate Seg-Agent’s strong generalization ability and its broad range of application scenarios. Please zoom in… view at source ↗

read the original abstract

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seg-Agent tries a training-free visual generate-select-refine loop with SoM marks to let an off-the-shelf MLLM handle language-guided segmentation, but the comparability claim lacks visible numbers and the spatial reliability of the loop is unproven.

read the letter

The new piece is the explicit three-stage multimodal loop that feeds Set-of-Mark visual prompts back to the MLLM so it can see and correct spatial choices instead of staying in text-only chain-of-thought. They also release Various-LangSeg, which mixes semantic, generic-object, and reasoning-guided cases, giving a broader test bed than most existing splits. The framework is simple to describe: generate candidate regions, mark them on the image, then iterate selection and refinement with the same MLLM. No parameters are updated, which is the practical hook for new domains or quick deployment. The description of how SoM rendering turns the image into something the model can reason over directly is straightforward and easy to follow. The main weakness is that the abstract asserts performance comparable to trained methods without showing the actual numbers, ablations, or error analysis. The paper itself notes that off-the-shelf MLLMs have limited spatial grounding, yet the entire argument rests on the loop fixing that limitation. If the selection or refinement steps simply propagate early mistakes on the reasoning-guided subset, the training-free advantage disappears. I would want to see stage-wise ablations and a breakdown of failure cases before accepting the comparability claim. Citations look standard and cover SAM, SoM, and recent MLLM work without obvious gaps or self-referential loops. This is aimed at people building or testing zero-shot vision-language tools who want to avoid large training runs. A reader who needs a new benchmark or a concrete procedural template could extract value even if they end up modifying the loop. It deserves peer review so the quantitative results and implementation details can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces Seg-Agent, a training-free framework for language-guided segmentation that implements explicit multimodal chain-of-reasoning via a three-stage interactive loop (generation, selection, refinement) on off-the-shelf MLLMs. It uses Set-of-Mark (SoM) visual prompting to render candidate regions directly on the image, enabling the MLLM to perform spatial reasoning in the visual domain rather than text-only. The central claim is that this approach achieves performance comparable to state-of-the-art training-based methods without any parameter updates. The paper also presents Various-LangSeg, a new benchmark spanning explicit semantic, generic object, and reasoning-guided segmentation tasks, with experiments demonstrating effectiveness and robustness.

Significance. If the quantitative results and ablations hold, the work would be significant for demonstrating that unmodified MLLMs can achieve competitive language-guided segmentation through test-time visual feedback loops, reducing reliance on large-scale supervised training. The Various-LangSeg benchmark could become a useful standard for evaluating generalization across diverse language-guided scenarios.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of 'performance comparable to state-of-the-art training-based methods' is asserted without any quantitative numbers, tables, or error analysis in the provided text; this is load-bearing because the training-free comparability rests entirely on unshown results.
[§3] §3 (Method): the three-stage visual reasoning loop is presented as correcting the acknowledged limited spatial grounding of off-the-shelf MLLMs via SoM, yet no ablation isolates whether selection/refinement steps reduce grounding errors versus propagating them on reasoning-guided cases; this directly undermines the weakest assumption that unmodified MLLMs can reliably execute iterative spatial selection without adaptation.

minor comments (2)

[§4.1] The description of Various-LangSeg would benefit from explicit dataset statistics, construction protocol, and split details to allow reproducibility.
[§3] Notation for the generation-selection-refinement stages could be formalized with pseudocode or equations to clarify the exact interaction with the MLLM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'performance comparable to state-of-the-art training-based methods' is asserted without any quantitative numbers, tables, or error analysis in the provided text; this is load-bearing because the training-free comparability rests entirely on unshown results.

Authors: The full manuscript includes Section 4 with quantitative tables, comparisons to SOTA training-based methods, and error analyses that support the abstract claim. We will revise to add explicit cross-references from the abstract and §3 to the specific tables and figures in §4. revision: partial
Referee: [§3] §3 (Method): the three-stage visual reasoning loop is presented as correcting the acknowledged limited spatial grounding of off-the-shelf MLLMs via SoM, yet no ablation isolates whether selection/refinement steps reduce grounding errors versus propagating them on reasoning-guided cases; this directly undermines the weakest assumption that unmodified MLLMs can reliably execute iterative spatial selection without adaptation.

Authors: We agree that a targeted ablation isolating grounding error reduction across stages (especially on reasoning-guided cases) would strengthen the paper. We will add this analysis in the revised version, measuring error rates before/after selection and refinement on the relevant subset of Various-LangSeg. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework with no derivation chain or fitted parameters

full rationale

The paper describes Seg-Agent as a training-free, procedural framework that applies off-the-shelf MLLMs to SoM-rendered images in a three-stage generation-selection-refinement loop. No equations, parameters, or mathematical derivations are present that could reduce to self-definition, fitted inputs called predictions, or self-citation load-bearing steps. The central claim of comparable performance rests on external benchmark results (Various-LangSeg) rather than any internal reduction to the method's own inputs. This is the expected non-finding for a methods paper whose reasoning chain is entirely empirical and non-derivational.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that current MLLMs possess sufficient visual grounding once candidate regions are explicitly marked on the image.

axioms (1)

domain assumption Off-the-shelf MLLMs can perform reliable spatial selection and iterative refinement when shown Set-of-Mark visual prompts
This is the key premise that allows the method to remain training-free.

pith-pipeline@v0.9.0 · 5602 in / 1110 out tokens · 35395 ms · 2026-05-14T19:44:19.960359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 46 canonical work pages · 8 internal anchors

[1]

Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

work page 2023
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Schwing, and Alexander Kirillov

Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. 2021

work page 2021
[5]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. 2022

work page 2022
[6]

Seg- ment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Seg- ment anything. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4015–4026, 2023

work page 2023
[7]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Lisa: Reasoning segmen- tation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmen- tation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

work page 2024
[10]

SegLLM: Multi-round reasoning segmen- tation with large language models

XuDong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Kon- stantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmen- tation with large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[11]

Text4seg: Reimagining image segmentation as text generation, 2024

Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation, 2024

work page 2024
[12]

Segformer: Simple and effi- cient design for semantic segmentation with transformers

EnzeXie,WenhaiWang,ZhidingYu,AnimaAnandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. InNeural Information Processing Systems (NeurIPS), 2021

work page 2021
[13]

Torr, and Li Zhang

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking seman- tic segmentation from a sequence-to-sequence perspective with transformers. InCVPR, 2021

2021
[14]

Gres: Generalized referring expression segmentation, 2023

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation, 2023

work page 2023
[15]

A simple yet effective network based on vision transformer for camouflaged object and salient object detection.IEEE Transactions on Image Processing, 2025

Chao Hao, Zitong Yu, Xin Liu, Jun Xu, Huanjing Yue, and Jingyu Yang. A simple yet effective network based on vision transformer for camouflaged object and salient object detection.IEEE Transactions on Image Processing, 2025

2025
[16]

Yuv20k: A complexity-driven benchmark and trajectory-aware align- ment model for video camouflaged object detection, 2026

Yiyu Liu, Shuo Ye, Chao Hao, and Zitong Yu. Yuv20k: A complexity-driven benchmark and trajectory-aware align- ment model for video camouflaged object detection, 2026

work page 2026
[17]

Distribution-specific learning for joint salient and camou- flaged object detection, 2025

Chao Hao, Zitong Yu, Xin Liu, Yuhao Wang, We- icheng Xie, Jingang Shi, Huanjing Yue, and Jingyu Yang. Distribution-specific learning for joint salient and camou- flaged object detection, 2025

work page 2025
[18]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

work page 2024
[19]

Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[20]

arXiv preprint arXiv:2503.06520 (2025)

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page arXiv 2025
[21]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

ReferItGame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in pho- tographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors,Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October 2014. Association for Comp...

work page 2014
[23]

Generation andcomprehensionofunambiguousobjectdescriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation andcomprehensionofunambiguousobjectdescriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 11–20. IEEE, June 2016

work page 2016
[24]

Androidworld: A dynamic benchmarking environment for autonomous agents, 2024

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024

work page 2024
[25]

Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024

work page 2024
[26]

Lawrence Zitnick, Dhruv Batra, and Devi Parikh

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering, 2016

work page 2016
[27]

Deep learning approaches on image captioning: A review.ACM Computing Surveys, 56(3):1–39, October 2023

Taraneh Ghandi, Hamidreza Pourreza, and Hamidreza Mahyar. Deep learning approaches on image captioning: A review.ACM Computing Surveys, 56(3):1–39, October 2023

work page 2023
[28]

Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024. Seg-Agent 13

work page 2024
[29]

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks, 2024

Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, and Jifeng Dai. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks, 2024

work page 2024
[30]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in neural infor- mation processing systems, 37:71737–71767, 2024

TaoZhang,XiangtaiLi,HaoFei,HaoboYuan,Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in neural infor- mation processing systems, 37:71737–71767, 2024

work page 2024
[31]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[33]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip H. S. Torr. Lavt: Language-aware vision transformer for referring image segmentation, 2022

2022
[35]

arXiv preprint arXiv:2501.04001 , year=

HaoboYuan,XiangtaiLi,TaoZhang,ZilongHuang,Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

work page arXiv 2025
[36]

Gsva: Generalized segmenta- tion via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmenta- tion via multimodal large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June 2024

2024
[37]

MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmen- tation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmen- tation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[38]

Simultaneously localize, segment and rank the camouflaged objects

Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11591–11601, 2021

work page 2021
[39]

The secrets of salient object segmentation

Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and Alan L Yuille. The secrets of salient object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 280–287, 2014

2014
[40]

Large-scale training of shadow detectors with noisily-annotated shadow examples

Tomás F Yago Vicente, Le Hou, Chen-Ping Yu, Minh Hoai, and Dimitris Samaras. Large-scale training of shadow detectors with noisily-annotated shadow examples. In Computer Vision–ECCV 2016: 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 816–832. Springer, 2016

work page 2016
[41]

Casia image tampering detection evaluation database

Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database. In2013 IEEE China Summit and International Conference on Signal and Information Processing, pages 422–426, 2013

work page 2013
[42]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Mi- crosoft coco: Common objects in context, 2015

2015
[43]

Salient object detection: A benchmark.IEEE transactions on image processing, 24(12):5706–5722, 2015

Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark.IEEE transactions on image processing, 24(12):5706–5722, 2015

work page 2015
[44]

Camouflagedobject detection

Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng,JianbingShen,andLingShao. Camouflagedobject detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020

work page 2020
[45]

Cris: Clip- driven referring image segmentation

ZhaoqingWang,YuLu,QiangLi,XunqiangTao,Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip- driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

2022
[46]

Open-vocabulary semantic segmen- tation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yi- nan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmen- tation with mask-adapted clip. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023

work page 2023
[47]

Generalized decoding for pixel, image, and language, 2022

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022

work page 2022
[48]

Segment everything everywhere all at once, 2023

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once, 2023

work page 2023
[49]

Perceptiongpt: Effectively fusing visual perception into llm

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27124–27133, June 2024

work page 2024
[50]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[51]

Internvl3: Ex- ploring advanced training and test-time recipes for open- source multimodal models, 2025

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page 2025
[52]

Think before you segment: High-quality reasoning seg- mentation with gpt chain of thoughts.arXiv preprint arXiv:2503.07503, 2025

Shiu-hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Think before you segment: High-quality reasoning seg- mentation with gpt chain of thoughts.arXiv preprint arXiv:2503.07503, 2025

work page arXiv 2025
[53]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025