arxiv: 2604.07916 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Weiming Zhang , Dingwen Xiao , Songyue Guo , Guangyu Xiang , Shiqi Wen , Minwei Zhao , Lei Chen , Lin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords referring expression segmentationSAM3training-freeprompt generationmask refinementDINOv3open-world segmentationexpression parsing

0 comments

The pith

Tarot-SAM3 turns any natural-language referring expression into reliable image masks by feeding structured prompts to SAM3 and then refining the results with DINOv3 feature comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tarot-SAM3, a training-free framework that lets the Segment Anything Model 3 handle referring expression segmentation for arbitrary text descriptions. Existing approaches need large annotated datasets and often fail on implicit or complex expressions, while simply pairing SAM3 with a multimodal language model leaves outputs unrefined. Tarot-SAM3 first uses an Expression Reasoning Interpreter to parse and rephrase expressions into heterogeneous prompts that produce initial SAM3 masks, then applies Mask Self-Refining to pick the best mask and correct errors by comparing discriminative regions with DINOv3 features. A sympathetic reader would care because this removes the need for task-specific training data and extends usable segmentation to open-world cases.

Core claim

Tarot-SAM3 is a training-free framework with two phases: the Expression Reasoning Interpreter phase that applies reasoning-assisted prompt options for structured parsing and evaluation-aware rephrasing to convert arbitrary queries into robust heterogeneous prompts for SAM3 mask generation, and the Mask Self-Refining phase that selects the best mask across prompt types and performs self-refinement by leveraging DINOv3 feature relationships to compare regions and infer target affiliation, thereby correcting over- and under-segmentation.

What carries the argument

The two-phase Tarot-SAM3 process in which the Expression Reasoning Interpreter converts expressions into heterogeneous SAM3 prompts and the Mask Self-Refining stage uses DINOv3 feature comparisons to select and correct masks.

Load-bearing premise

The Expression Reasoning Interpreter can reliably turn arbitrary expressions into prompts that produce usable SAM3 initial masks, and DINOv3 feature comparisons can correctly identify region affiliation without introducing new errors.

What would settle it

A benchmark set of complex implicit expressions where the ERI-generated masks remain poor after refinement and overall accuracy falls below that of trained RES models.

Figures

Figures reproduced from arXiv: 2604.07916 by Dingwen Xiao, Guangyu Xiang, Lei Chen, Lin Wang, Minwei Zhao, Shiqi Wen, Songyue Guo, Weiming Zhang.

**Figure 1.** Figure 1: Tarot-SAM3 achieves reliable referring segmentation for both explicit (top row) and implicit (middle row) reasoning queries. The bottom row highlights that TarotSAM3 breaks the strong reliance of prior methods [2,56] on direct MLLM parsing (left) and naive SAM3 text-mask predictions (right). Abstract. Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expre… view at source ↗

**Figure 2.** Figure 2: Overview of our Tarot-SAM3 Framework. Zoom in for better view. 3 Method 3.1 Overview As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example visualizations on explicit RES benchmarks. RES ReasonSeg Short ReasonSeg Long SAM3-Agent Evol-SAM Tarot-SAM3 GT “ the part that can transmit traction and braking torque, coming into contact with the ground ” “ something used for playing videos or music ” “ the ball that can only be hit into the hole at last ” “ unopened flower bud ” [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 5.** Figure 5: Example visualizations of the sequential application of the two phases. sively enabling each component, we analyze the contribution of the ERI and MSR phases to the overall performance [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparisons of augmented text prompt, zoom in for better view. Text Query: “the person who is speaking currently” Input Image Ground Truth MLLM Bbox (Initial text query ) SAM3 Mask (Initial text query ) MLLM Bbox (Rephrased text query ) SAM3 Mask (Rephrased text query ) Eq. (5) Rephrased Text Query: “The woman holding the microphone” [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: Open-world visual comparisons. Text Query: “far left crate” A. Ambiguous Spatial Understanding Image Tarot-SAM3 GT Text Query: “the region exhibiting unusual color” B. Misunderstanding about Target Level Image Tarot-SAM3 GT [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tarot-SAM3 gives a concrete training-free recipe with ERI prompt generation and DINOv3-based mask refinement, but the performance claims are unbacked by numbers and the refinement step risks new errors on implicit expressions.

read the letter

The paper's core idea is a two-stage training-free pipeline for referring expression segmentation: first an Expression Reasoning Interpreter that parses arbitrary expressions into structured heterogeneous prompts for SAM3, then a Mask Self-Refining stage that picks the best initial mask and uses DINOv3 feature comparisons to correct over- or under-segmentation by inferring region affiliation. This directly targets the weaknesses the authors flag in plain SAM3 use and in naive SAM3-plus-MLLM coupling. The design is procedural and avoids new training, which is a practical angle for anyone wanting to extend foundation models without dataset collection. That part is clear and addresses a real gap in handling both explicit and implicit expressions plus open-world cases. The ERI phase in particular looks like a reasonable way to make prompt options more robust through reasoning-assisted rephrasing. The MSR phase tries to add a self-correction loop that prior direct couplings lack. Those elements are new enough in combination and show honest engagement with the cited limitations of SAM3 on longer or implicit inputs. The main weaknesses are the missing evidence and the untested assumptions in the refinement. The abstract states strong benchmark results on explicit, implicit, and open-world RES but supplies no quantitative scores, baselines, or ablation numbers, so the central claim cannot be evaluated yet. The MSR step assumes DINOv3 embeddings will reliably distinguish target affiliation even when expressions are implicit and visual similarity may not match the linguistic description; without error analysis, failure cases, or an ablation isolating MSR's correction rate, this remains a soft spot that could introduce mismatches rather than fix them. The method is a pipeline on top of external models with no fitted parameters or circular derivations, which keeps the circularity burden low. This work is aimed at researchers in vision-language segmentation who want training-free extensions of SAM-style models. If the full paper includes the promised experiments and some analysis of when MSR succeeds or fails, it would be worth a serious referee to check the numbers and the implicit-expression cases. I would send it to peer review for that reason.

Referee Report

3 major / 2 minor

Summary. The paper proposes Tarot-SAM3, a training-free framework for referring expression segmentation (RES) applicable to any expression. It comprises two phases: the Expression Reasoning Interpreter (ERI), which parses arbitrary natural-language expressions via reasoning-assisted prompts and evaluation-aware rephrasing to produce heterogeneous inputs for SAM3 mask generation; and the Mask Self-Refining (MSR) phase, which selects the best mask across prompt variants and applies DINOv3 feature comparisons to infer target region affiliation, thereby correcting over- and under-segmentation. The central claim is that this pipeline achieves strong performance on explicit and implicit RES benchmarks as well as open-world scenarios without any training or annotated data.

Significance. If the empirical claims hold, the work is significant as a training-free, generalizable alternative to dataset-dependent RES methods and naive SAM3+MLLM couplings. It explicitly leverages pre-trained models (SAM3 for promptable segmentation and DINOv3 for feature-based refinement) in a procedural pipeline, avoiding the need for large annotated datasets or fine-tuning. This addresses a clear limitation in current RES literature regarding generalization to implicit and open-world expressions.

major comments (3)

[Abstract] Abstract: The assertion that Tarot-SAM3 'achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios' is not accompanied by any quantitative metrics, baseline comparisons, or ablation numbers. This is load-bearing for the central claim, as the effectiveness of ERI prompt generation and MSR correction cannot be assessed without these results.
[Mask Self-Refining (MSR) phase] Mask Self-Refining (MSR) phase: The claim that DINOv3 feature comparisons reliably infer region affiliation and correct SAM3 over-/under-segmentation for implicit expressions lacks supporting error analysis, failure cases, or quantitative ablation isolating MSR's correction rate. This assumption is load-bearing because visual similarity in DINOv3 embeddings may not align with linguistic intent in complex implicit cases, potentially introducing new mismatches rather than fixing them.
[Expression Reasoning Interpreter (ERI) phase] Expression Reasoning Interpreter (ERI) phase: The description of how 'reasoning-assisted prompt options' and 'evaluation-aware rephrasing' transform arbitrary queries into robust heterogeneous prompts for SAM3 provides no implementation details, pseudocode, or examples of the reasoning process. This is necessary to evaluate the reliability of the initial masks on which MSR depends.

minor comments (2)

[Abstract] The acronym 'SAM3' is used before its expansion as 'Segment Anything Model 3' in the abstract; define all acronyms at first use.
[Abstract] The abstract contains several long compound sentences that reduce readability; consider splitting for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed each major comment below, providing clarifications and committing to specific revisions that will strengthen the paper's clarity, reproducibility, and empirical support without misrepresenting our contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that Tarot-SAM3 'achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios' is not accompanied by any quantitative metrics, baseline comparisons, or ablation numbers. This is load-bearing for the central claim, as the effectiveness of ERI prompt generation and MSR correction cannot be assessed without these results.

Authors: We agree that the abstract should include concrete quantitative support for the performance claims. Although the full manuscript contains extensive experimental results with metrics, baselines, and ablations, these are not summarized in the abstract. In the revised version, we will update the abstract to report key mIoU and other metrics on explicit/implicit benchmarks and open-world scenarios, along with brief baseline comparisons and ablation highlights. This will make the central claims more self-contained and evidence-based. revision: yes
Referee: [Mask Self-Refining (MSR) phase] Mask Self-Refining (MSR) phase: The claim that DINOv3 feature comparisons reliably infer region affiliation and correct SAM3 over-/under-segmentation for implicit expressions lacks supporting error analysis, failure cases, or quantitative ablation isolating MSR's correction rate. This assumption is load-bearing because visual similarity in DINOv3 embeddings may not align with linguistic intent in complex implicit cases, potentially introducing new mismatches rather than fixing them.

Authors: This is a valid point; while the manuscript includes ablation studies validating each phase overall, it does not provide a dedicated quantitative isolation of MSR's correction rate, error analysis, or failure cases for implicit expressions. We will revise the paper to add a focused ablation measuring MSR's improvement (e.g., percentage of over-/under-segmented masks corrected), representative failure cases with analysis of when DINOv3 similarity may diverge from linguistic intent, and a balanced discussion of limitations. This will directly address the load-bearing assumption. revision: yes
Referee: [Expression Reasoning Interpreter (ERI) phase] Expression Reasoning Interpreter (ERI) phase: The description of how 'reasoning-assisted prompt options' and 'evaluation-aware rephrasing' transform arbitrary queries into robust heterogeneous prompts for SAM3 provides no implementation details, pseudocode, or examples of the reasoning process. This is necessary to evaluate the reliability of the initial masks on which MSR depends.

Authors: We agree that additional implementation transparency is needed for the ERI phase to support evaluation and reproducibility. The current description is high-level; in the revised manuscript, we will include pseudocode for the full ERI pipeline, concrete examples of reasoning-assisted prompt options and evaluation-aware rephrasing applied to sample expressions (both explicit and implicit), and details on how these produce heterogeneous inputs for SAM3. This will clarify the process and allow assessment of initial mask reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free procedural pipeline on external models

full rationale

The paper describes a two-phase framework (ERI for prompt parsing/rephrasing and MSR for DINOv3-based mask refinement) that operates as a procedural composition of pre-trained external models (SAM3, DINOv3, MLLM). No equations, parameter fitting, self-definitional loops, or derivations appear; performance claims rest on empirical benchmark results rather than any reduction of outputs to inputs by construction. No load-bearing self-citations or ansatzes are invoked in the abstract or described method.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on untested assumptions about SAM3 prompt robustness and DINOv3 feature discriminability; no free parameters are introduced because the method is training-free, but two new procedural modules are postulated without external validation.

axioms (2)

domain assumption SAM3 produces reliable initial masks when given heterogeneous prompts derived from expression reasoning
Invoked in the ERI phase description
domain assumption DINOv3 features allow accurate inference of region affiliation to correct over- and under-segmentation
Invoked in the MSR phase description

invented entities (2)

Expression Reasoning Interpreter (ERI) no independent evidence
purpose: Structured parsing and evaluation-aware rephrasing of arbitrary queries into robust prompts
New component introduced to bridge language to SAM3
Mask Self-Refining (MSR) no independent evidence
purpose: Best-mask selection and feature-based correction of segmentation errors
New component introduced to refine SAM3 outputs

pith-pipeline@v0.9.0 · 5617 in / 1326 out tokens · 42098 ms · 2026-05-10T18:06:21.743086+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase... Second, the Mask Self-Refining (MSR) phase selects the best mask... leveraging DINOv3 feature relationships
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSR... infers region affiliation to the target, thereby correcting over- and under-segmentation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 16 canonical work pages · 3 internal anchors

[1]

In: European Conference on Computer Vision

An, X., Yang, K., Dai, X., Feng, Z., Deng, J.: Multi-label cluster discrimination for visual representation learning. In: European Conference on Computer Vision. pp. 428–444. Springer (2024)

2024
[2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: Euro- pean Conference on Computer Vision

Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. In: Euro- pean Conference on Computer Vision. pp. 323–340. Springer (2024)

2024
[4]

ArXiv abs/2409.10542(2024)

Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. ArXiv abs/2409.10542(2024)

work page arXiv 2024
[5]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7900–7916 (2022)

Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7900–7916 (2022)

2022
[6]

In: International Conference on Machine Learning (2023)

Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q.H., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.R.: Palm-e: An embodied multimodal language model. In: International Conference ...

2023
[7]

Du, T., Li, H., Fan, Z., Zhang, J., Pan, P., Zhang, Y.: Sam-veteran: An mllm-based human-likesamagentforreasoningsegmentation.In:TheFourteenthInternational Conference on Learning Representations
[8]

In: European conference on computer vision

Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expres- sions. In: European conference on computer vision. pp. 108–124. Springer (2016)

2016
[9]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Hu, Y., Wang, Q., Shao, W., Xie, E., Li, Z., Han, J., Luo, P.: Beyond one-to-one: Rethinking the referring image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4067–4077 (October 2023)

2023
[10]

In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 4424–4433 (2020)

2020
[11]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Huang, J., Xu, Z., Liu, T., Liu, Y., Han, H., Yuan, K., Li, X.: Densely connected parameter-efficient tuning for referring image segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 3653–3661 (2025)

2025
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Huang, S., Hui, T., Liu, S., Li, G., Wei, Y., Han, J., Liu, L., Li, B.: Referring image segmentation via cross-modal progressive comprehension. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10488–10497 (2020)

2020
[13]

Medical sam3: A foundation model for universal prompt-driven medical image segmen- tation.arXiv preprint arXiv:2601.10880, 2026

Jiang, C., Ding, T., Song, C., Tu, J., Yan, Z., Shao, Y., Wang, Z., Shang, Y., Han, T., Tian, Y.: Medical sam3: A foundation model for universal prompt-driven medical image segmentation. ArXivabs/2601.10880(2026) 16 Zhang. Author et al

work page arXiv 2026
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jing, Y., Kong, T., Wang, W., Wang, L., Li, L., Tan, T.: Locate then segment: A strong pipeline for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9858–9867 (2021)

2021
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: Restr: Convolution-free refer- ring image segmentation using transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18145–18154 (2022)

2022
[16]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 9579–9589 (2023)

2024
[17]

Pattern Recognition p

Li, J., Xie, Q., Gu, R., Xu, J., Liu, Y., Yu, X.: Lgd: Leveraging generative descrip- tions for zero-shot referring image segmentation. Pattern Recognition p. 112549 (2025)

2025
[18]

SAM3-I: Segment Anything with Instructions

Li, J., Feng, Y., Guo, Y., Huang, J., Piao, Y., Bi, Q., Zhang, M., Zhao, X., Chen, Q., Zou, S., Ji, W., Lu, H., Cheng, L.: Sam3-i: Segment anything with instructions. ArXivabs/2512.04585(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Li, R., Li, K., Kuo, Y.C., Shu, M., Qi, X., Shen, X., Jia, J.: Referring image segmen- tation via recurrent refinement networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5745–5753 (2018)

2018
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(8), 10055–10069 (2023)

Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(8), 10055–10069 (2023)

2023
[21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Mar- culescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 7061–7070 (2023)

2023
[22]

In: Proceedings of the IEEE international conference on computer vision

Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal inter- action for referring image segmentation. In: Proceedings of the IEEE international conference on computer vision. pp. 1271–1280 (2017)

2017
[23]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 4761–4775 (2021)

Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., Li, G.: Cross-modal progressive com- prehension for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 4761–4775 (2021)

2021
[24]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Liu, T., Li, S.: Hybrid global-local representation with augmented spatial guidance for zero-shot referring image segmentation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 29634–29643 (2025)

2025
[25]

Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025

Liu, Y., Ma, M., Yu, X., Ding, P., Zhao, H., Sun, M., Huang, S., Wang, D.: Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. arXiv preprint arXiv:2505.12448 (2025)

work page arXiv 2025
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Y., Zhang, C., Wang, Y., Wang, J., Yang, Y., Tang, Y.: Universal segmen- tation at arbitrary granularity with language instruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3459– 3469 (2024)

2024
[27]

arXiv preprint arXiv:2503.06520 (2025)

Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025)

work page arXiv 2025
[28]

In: International Conference on Learning Repre- sentations (2023)

Lu, P., Bansal, H., Xia, T., Liu, J., yue Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foun- dation models in visual contexts. In: International Conference on Learning Repre- sentations (2023)

2023
[29]

Tarot-SAM3 17 In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lu, Y., Cao, J., Wu, Y., Li, B., Tang, L., Ji, Y., Wu, C., Wu, J., Zhu, W.: Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. Tarot-SAM3 17 In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 14699–14716 (2025)

2025
[30]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

2016
[31]

Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777, 2023

Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., Zuo, W.: Ref-diff: Zero-shot refer- ring image segmentation with generative models. arXiv preprint arXiv:2308.16777 (2023)

work page arXiv 2023
[32]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025)

2025
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13009–13018 (2024)

2024
[34]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page Pith review arXiv 2024
[35]

Available at SSRN 5233953 (2025)

Sapkota, R., Karkee, M.: Object detection with multimodal large vision-language models: An in-depth review. Available at SSRN 5233953 (2025)

2025
[36]

Sim’eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., J’egou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025)

2025
[37]

Companion Proceed- ings of the ACM on Web Conference 2025 (2025)

Song, J., Hua, Z., Zan, H., Han, Y., Peng, M.: Optimizing discriminative vision- language models for efficient multimodal intent recognition. Companion Proceed- ings of the ACM on Web Conference 2025 (2025)

2025
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sun, S., Li, R., Torr, P., Gu, X., Li, S.: Clip as rnn: Segment countless visual concepts without training endeavor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13171–13182 (2024)

2024
[39]

In: Findings of the Association for Computational Linguistics: EMNLP 2023

Suo, Y., Zhu, L., Yang, Y.: Text augmented spatial aware zero-shot referring im- age segmentation. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 1032–1043 (2023)

2023
[40]

arXiv preprint arXiv:2508.04655 (2025)

Wang, H., Qiao, L., Jie, Z., Huang, Z., Feng, C., Zheng, Q., Ma, L., Lan, X., Liang, X.: X-sam: From segment anything to any segmentation. arXiv preprint arXiv:2508.04655 (2025)

work page arXiv 2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, J., Ke, L.: Llm-seg: Bridging image segmentation and large language model reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1765–1774 (2024)

2024
[42]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Wang, S., Kim, D., Taalimi, A., Sun, C., Kuo, W.: Learning visual grounding from generative vision and language model. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 8057–8067. IEEE (2025)

2025
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, W., Yue, T., Zhang, Y., Guo, L., He, X., Wang, X., Liu, J.: Unveiling parts beyond objects: Towards finer-granularity referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12998–13008 (2024)

2024
[44]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, Y., Ni, J., Liu, Y., Yuan, C., Tang, Y.: Iterprime: Zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 8159– 8168 (2025) 18 Zhang. Author et al

2025
[45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)

2022
[46]

Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024

Wei, C., Zhong, Y., Tan, H., Liu, Y., Zhao, Z., Hu, J., Yang, Y.: Hyperseg: To- wards universal visual segmentation with large language model. arXiv preprint arXiv:2411.17606 (2024)

work page arXiv 2024
[47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, J., Jiang, Y., Liu, Q., Yuan, Z., Bai, X., Bai, S.: General object foundation model for images and videos at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3783–3795 (2024)

2024
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmen- tation via multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3858–3869 (2024)

2024
[49]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xie, Y., Yang, K., An, X., Wu, K., Zhao, Y., Deng, W., Ran, Z., Wang, Y., Feng, Z., Miles, R., et al.: Region-based cluster discrimination for visual representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1793–1803 (2025)

2025
[50]

ArXivabs/2512.01789(2025)

Xiong, X., Wu, Z., Lu, L., Xia, Y.: Sam3-unet: Simplified adaptation of segment anything model 3. ArXivabs/2512.01789(2025)

work page arXiv 2025
[51]

In: Proceedings of the IEEE/CVF international conference on computer vision

Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and lan- guage encoders: Parameter-efficient tuning for referring image segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 17503–17512 (2023)

2023
[52]

Lisa++: An improved baseline for reasoning segmentation with large language model,

Yang, S., Qu, T., Lai, X., Tian, Z., Peng, B., Liu, S., Jia, J.: Lisa++: An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240 (2023)

work page arXiv 2023
[53]

IEEE Transactions on Multimedia24, 3611–3623 (2021)

Yang, S., Wang, Y., Chen, K., Zeng, W., Fei, Z.: Attribute-aware feature encoding for object recognition and segmentation. IEEE Transactions on Multimedia24, 3611–3623 (2021)

2021
[54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155– 18165 (2022)

2022
[55]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXivabs/2303.11381(2023)

work page internal anchor Pith review arXiv 2023
[56]

ArXivabs/2512.24702 (2025)

Ye, K., You, X., Lin, J., Ji, J., Dai, P., Cao, L.: Evolving, not training: Zero- shot reasoning segmentation via evolutionary prompting. ArXivabs/2512.24702 (2025)

work page arXiv 2025
[57]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10502–10511 (2019)

2019
[58]

ACM Computing Surveys (2024)

Yi, Z., Ouyang, J., Xu, Z., Liu, Y., Liao, T., Luo, H., Shen, Y.: A survey on re- cent advances in llm-based multi-turn dialogue systems. ACM Computing Surveys (2024)

2024
[59]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Mod- ular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1307–1315 (2018)

2018
[60]

In: European conference on computer vision

Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in refer- ring expressions. In: European conference on computer vision. pp. 69–85. Springer (2016) Tarot-SAM3 19

2016
[61]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu, S., Seo, P.H., Son, J.: Zero-shot referring image segmentation with global-local context features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19456–19465 (2023)

2023
[62]

In: European Conference on Computer Vision

Yu, S., Seo, P.H., Son, J.: Pseudo-ris: Distinctive pseudo-supervision generation for referring image segmentation. In: European Conference on Computer Vision. pp. 18–36. Springer (2024)

2024
[63]

arXiv preprint arXiv:2406.20076 (2024)

Zhang, Y., Cheng, T., Zhu, L., Hu, R., Liu, L., Liu, H., Ran, L., Chen, X., Liu, W., Wang, X.: Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076 (2024)

work page arXiv 2024
[64]

In: European Conference on Computer Vision

Zhang, Z., Ma, Y., Zhang, E., Bai, X.: Psalm: Pixelwise segmentation with large multi-modal model. In: European Conference on Computer Vision. pp. 74–91. Springer (2024)

2024
[65]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a generalist model for embodied navigation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 13624–13634 (2023)

2024
[66]

arXiv preprint arXiv:2508.14153 (2025)

Zhu, L., Ouyang, B., Zhang, Y., Cheng, T., Hu, R., Shen, H., Ran, L., Chen, X., Yu, L., Liu, W., et al.: Lens: Learning to segment anything with unified reinforced reasoning. arXiv preprint arXiv:2508.14153 (2025)

work page arXiv 2025
[67]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhu, M., Tian, Y., Chen, H., Zhou, C., Guo, Q., Liu, Y., Yang, M., Shen, C.: Segagent: Exploring pixel understanding capabilities in mllms by imitating hu- man annotator trajectories. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3686–3696 (2025)

2025
[68]

Advances in neural information processing systems36, 19769–19782 (2023)

Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. Advances in neural information processing systems36, 19769–19782 (2023)

2023