pith. machine review for the scientific record. sign in

arxiv: 2604.07916 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords referring expression segmentationSAM3training-freeprompt generationmask refinementDINOv3open-world segmentationexpression parsing
0
0 comments X

The pith

Tarot-SAM3 turns any natural-language referring expression into reliable image masks by feeding structured prompts to SAM3 and then refining the results with DINOv3 feature comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tarot-SAM3, a training-free framework that lets the Segment Anything Model 3 handle referring expression segmentation for arbitrary text descriptions. Existing approaches need large annotated datasets and often fail on implicit or complex expressions, while simply pairing SAM3 with a multimodal language model leaves outputs unrefined. Tarot-SAM3 first uses an Expression Reasoning Interpreter to parse and rephrase expressions into heterogeneous prompts that produce initial SAM3 masks, then applies Mask Self-Refining to pick the best mask and correct errors by comparing discriminative regions with DINOv3 features. A sympathetic reader would care because this removes the need for task-specific training data and extends usable segmentation to open-world cases.

Core claim

Tarot-SAM3 is a training-free framework with two phases: the Expression Reasoning Interpreter phase that applies reasoning-assisted prompt options for structured parsing and evaluation-aware rephrasing to convert arbitrary queries into robust heterogeneous prompts for SAM3 mask generation, and the Mask Self-Refining phase that selects the best mask across prompt types and performs self-refinement by leveraging DINOv3 feature relationships to compare regions and infer target affiliation, thereby correcting over- and under-segmentation.

What carries the argument

The two-phase Tarot-SAM3 process in which the Expression Reasoning Interpreter converts expressions into heterogeneous SAM3 prompts and the Mask Self-Refining stage uses DINOv3 feature comparisons to select and correct masks.

Load-bearing premise

The Expression Reasoning Interpreter can reliably turn arbitrary expressions into prompts that produce usable SAM3 initial masks, and DINOv3 feature comparisons can correctly identify region affiliation without introducing new errors.

What would settle it

A benchmark set of complex implicit expressions where the ERI-generated masks remain poor after refinement and overall accuracy falls below that of trained RES models.

Figures

Figures reproduced from arXiv: 2604.07916 by Dingwen Xiao, Guangyu Xiang, Lei Chen, Lin Wang, Minwei Zhao, Shiqi Wen, Songyue Guo, Weiming Zhang.

Figure 1
Figure 1. Figure 1: Tarot-SAM3 achieves reliable referring segmentation for both explicit (top row) and implicit (middle row) reasoning queries. The bottom row highlights that Tarot￾SAM3 breaks the strong reliance of prior methods [2,56] on direct MLLM parsing (left) and naive SAM3 text-mask predictions (right). Abstract. Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expre… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our Tarot-SAM3 Framework. Zoom in for better view. 3 Method 3.1 Overview As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example visualizations on ex￾plicit RES benchmarks. RES ReasonSeg Short ReasonSeg Long SAM3-Agent Evol-SAM Tarot-SAM3 GT “ the part that can transmit traction and braking torque, coming into contact with the ground ” “ something used for playing videos or music ” “ the ball that can only be hit into the hole at last ” “ unopened flower bud ” [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example visualizations of the sequential application of the two phases. sively enabling each component, we analyze the contribution of the ERI and MSR phases to the overall performance [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparisons of augmented text prompt, zoom in for better view. Text Query: “the person who is speaking currently” Input Image Ground Truth MLLM Bbox (Initial text query ) SAM3 Mask (Initial text query ) MLLM Bbox (Rephrased text query ) SAM3 Mask (Rephrased text query ) Eq. (5) Rephrased Text Query: “The woman holding the microphone” [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Open-world visual comparisons. Text Query: “far left crate” A. Ambiguous Spatial Understanding Image Tarot-SAM3 GT Text Query: “the region exhibiting unusual color” B. Misunderstanding about Target Level Image Tarot-SAM3 GT [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Tarot-SAM3, a training-free framework for referring expression segmentation (RES) applicable to any expression. It comprises two phases: the Expression Reasoning Interpreter (ERI), which parses arbitrary natural-language expressions via reasoning-assisted prompts and evaluation-aware rephrasing to produce heterogeneous inputs for SAM3 mask generation; and the Mask Self-Refining (MSR) phase, which selects the best mask across prompt variants and applies DINOv3 feature comparisons to infer target region affiliation, thereby correcting over- and under-segmentation. The central claim is that this pipeline achieves strong performance on explicit and implicit RES benchmarks as well as open-world scenarios without any training or annotated data.

Significance. If the empirical claims hold, the work is significant as a training-free, generalizable alternative to dataset-dependent RES methods and naive SAM3+MLLM couplings. It explicitly leverages pre-trained models (SAM3 for promptable segmentation and DINOv3 for feature-based refinement) in a procedural pipeline, avoiding the need for large annotated datasets or fine-tuning. This addresses a clear limitation in current RES literature regarding generalization to implicit and open-world expressions.

major comments (3)
  1. [Abstract] Abstract: The assertion that Tarot-SAM3 'achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios' is not accompanied by any quantitative metrics, baseline comparisons, or ablation numbers. This is load-bearing for the central claim, as the effectiveness of ERI prompt generation and MSR correction cannot be assessed without these results.
  2. [Mask Self-Refining (MSR) phase] Mask Self-Refining (MSR) phase: The claim that DINOv3 feature comparisons reliably infer region affiliation and correct SAM3 over-/under-segmentation for implicit expressions lacks supporting error analysis, failure cases, or quantitative ablation isolating MSR's correction rate. This assumption is load-bearing because visual similarity in DINOv3 embeddings may not align with linguistic intent in complex implicit cases, potentially introducing new mismatches rather than fixing them.
  3. [Expression Reasoning Interpreter (ERI) phase] Expression Reasoning Interpreter (ERI) phase: The description of how 'reasoning-assisted prompt options' and 'evaluation-aware rephrasing' transform arbitrary queries into robust heterogeneous prompts for SAM3 provides no implementation details, pseudocode, or examples of the reasoning process. This is necessary to evaluate the reliability of the initial masks on which MSR depends.
minor comments (2)
  1. [Abstract] The acronym 'SAM3' is used before its expansion as 'Segment Anything Model 3' in the abstract; define all acronyms at first use.
  2. [Abstract] The abstract contains several long compound sentences that reduce readability; consider splitting for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully addressed each major comment below, providing clarifications and committing to specific revisions that will strengthen the paper's clarity, reproducibility, and empirical support without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that Tarot-SAM3 'achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios' is not accompanied by any quantitative metrics, baseline comparisons, or ablation numbers. This is load-bearing for the central claim, as the effectiveness of ERI prompt generation and MSR correction cannot be assessed without these results.

    Authors: We agree that the abstract should include concrete quantitative support for the performance claims. Although the full manuscript contains extensive experimental results with metrics, baselines, and ablations, these are not summarized in the abstract. In the revised version, we will update the abstract to report key mIoU and other metrics on explicit/implicit benchmarks and open-world scenarios, along with brief baseline comparisons and ablation highlights. This will make the central claims more self-contained and evidence-based. revision: yes

  2. Referee: [Mask Self-Refining (MSR) phase] Mask Self-Refining (MSR) phase: The claim that DINOv3 feature comparisons reliably infer region affiliation and correct SAM3 over-/under-segmentation for implicit expressions lacks supporting error analysis, failure cases, or quantitative ablation isolating MSR's correction rate. This assumption is load-bearing because visual similarity in DINOv3 embeddings may not align with linguistic intent in complex implicit cases, potentially introducing new mismatches rather than fixing them.

    Authors: This is a valid point; while the manuscript includes ablation studies validating each phase overall, it does not provide a dedicated quantitative isolation of MSR's correction rate, error analysis, or failure cases for implicit expressions. We will revise the paper to add a focused ablation measuring MSR's improvement (e.g., percentage of over-/under-segmented masks corrected), representative failure cases with analysis of when DINOv3 similarity may diverge from linguistic intent, and a balanced discussion of limitations. This will directly address the load-bearing assumption. revision: yes

  3. Referee: [Expression Reasoning Interpreter (ERI) phase] Expression Reasoning Interpreter (ERI) phase: The description of how 'reasoning-assisted prompt options' and 'evaluation-aware rephrasing' transform arbitrary queries into robust heterogeneous prompts for SAM3 provides no implementation details, pseudocode, or examples of the reasoning process. This is necessary to evaluate the reliability of the initial masks on which MSR depends.

    Authors: We agree that additional implementation transparency is needed for the ERI phase to support evaluation and reproducibility. The current description is high-level; in the revised manuscript, we will include pseudocode for the full ERI pipeline, concrete examples of reasoning-assisted prompt options and evaluation-aware rephrasing applied to sample expressions (both explicit and implicit), and details on how these produce heterogeneous inputs for SAM3. This will clarify the process and allow assessment of initial mask reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free procedural pipeline on external models

full rationale

The paper describes a two-phase framework (ERI for prompt parsing/rephrasing and MSR for DINOv3-based mask refinement) that operates as a procedural composition of pre-trained external models (SAM3, DINOv3, MLLM). No equations, parameter fitting, self-definitional loops, or derivations appear; performance claims rest on empirical benchmark results rather than any reduction of outputs to inputs by construction. No load-bearing self-citations or ansatzes are invoked in the abstract or described method.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on untested assumptions about SAM3 prompt robustness and DINOv3 feature discriminability; no free parameters are introduced because the method is training-free, but two new procedural modules are postulated without external validation.

axioms (2)
  • domain assumption SAM3 produces reliable initial masks when given heterogeneous prompts derived from expression reasoning
    Invoked in the ERI phase description
  • domain assumption DINOv3 features allow accurate inference of region affiliation to correct over- and under-segmentation
    Invoked in the MSR phase description
invented entities (2)
  • Expression Reasoning Interpreter (ERI) no independent evidence
    purpose: Structured parsing and evaluation-aware rephrasing of arbitrary queries into robust prompts
    New component introduced to bridge language to SAM3
  • Mask Self-Refining (MSR) no independent evidence
    purpose: Best-mask selection and feature-based correction of segmentation errors
    New component introduced to refine SAM3 outputs

pith-pipeline@v0.9.0 · 5617 in / 1326 out tokens · 42098 ms · 2026-05-10T18:06:21.743086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    In: European Conference on Computer Vision

    An, X., Yang, K., Dai, X., Feng, Z., Deng, J.: Multi-label cluster discrimination for visual representation learning. In: European Conference on Computer Vision. pp. 428–444. Springer (2024)

  2. [2]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  3. [3]

    In: Euro- pean Conference on Computer Vision

    Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. In: Euro- pean Conference on Computer Vision. pp. 323–340. Springer (2024)

  4. [4]

    ArXiv abs/2409.10542(2024)

    Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. ArXiv abs/2409.10542(2024)

  5. [5]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7900–7916 (2022)

    Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 7900–7916 (2022)

  6. [6]

    In: International Conference on Machine Learning (2023)

    Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q.H., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.R.: Palm-e: An embodied multimodal language model. In: International Conference ...

  7. [7]

    Du, T., Li, H., Fan, Z., Zhang, J., Pan, P., Zhang, Y.: Sam-veteran: An mllm-based human-likesamagentforreasoningsegmentation.In:TheFourteenthInternational Conference on Learning Representations

  8. [8]

    In: European conference on computer vision

    Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expres- sions. In: European conference on computer vision. pp. 108–124. Springer (2016)

  9. [9]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Hu, Y., Wang, Q., Shao, W., Xie, E., Li, Z., Han, J., Luo, P.: Beyond one-to-one: Rethinking the referring image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4067–4077 (October 2023)

  10. [10]

    In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

    Hu, Z., Feng, G., Sun, J., Zhang, L., Lu, H.: Bi-directional relationship inferring network for referring image segmentation. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 4424–4433 (2020)

  11. [11]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Huang, J., Xu, Z., Liu, T., Liu, Y., Han, H., Yuan, K., Li, X.: Densely connected parameter-efficient tuning for referring image segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 3653–3661 (2025)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Huang, S., Hui, T., Liu, S., Li, G., Wei, Y., Han, J., Liu, L., Li, B.: Referring image segmentation via cross-modal progressive comprehension. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10488–10497 (2020)

  13. [13]

    Medical sam3: A foundation model for universal prompt-driven medical image segmen- tation.arXiv preprint arXiv:2601.10880, 2026

    Jiang, C., Ding, T., Song, C., Tu, J., Yan, Z., Shao, Y., Wang, Z., Shang, Y., Han, T., Tian, Y.: Medical sam3: A foundation model for universal prompt-driven medical image segmentation. ArXivabs/2601.10880(2026) 16 Zhang. Author et al

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jing, Y., Kong, T., Wang, W., Wang, L., Li, L., Tan, T.: Locate then segment: A strong pipeline for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9858–9867 (2021)

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: Restr: Convolution-free refer- ring image segmentation using transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18145–18154 (2022)

  16. [16]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 9579–9589 (2023)

  17. [17]

    Pattern Recognition p

    Li, J., Xie, Q., Gu, R., Xu, J., Liu, Y., Yu, X.: Lgd: Leveraging generative descrip- tions for zero-shot referring image segmentation. Pattern Recognition p. 112549 (2025)

  18. [18]

    SAM3-I: Segment Anything with Instructions

    Li, J., Feng, Y., Guo, Y., Huang, J., Piao, Y., Bi, Q., Zhang, M., Zhao, X., Chen, Q., Zou, S., Ji, W., Lu, H., Cheng, L.: Sam3-i: Segment anything with instructions. ArXivabs/2512.04585(2025)

  19. [19]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Li, R., Li, K., Kuo, Y.C., Shu, M., Qi, X., Shen, X., Jia, J.: Referring image segmen- tation via recurrent refinement networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5745–5753 (2018)

  20. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(8), 10055–10069 (2023)

    Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(8), 10055–10069 (2023)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Mar- culescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 7061–7070 (2023)

  22. [22]

    In: Proceedings of the IEEE international conference on computer vision

    Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal inter- action for referring image segmentation. In: Proceedings of the IEEE international conference on computer vision. pp. 1271–1280 (2017)

  23. [23]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 4761–4775 (2021)

    Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., Li, G.: Cross-modal progressive com- prehension for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 4761–4775 (2021)

  24. [24]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Liu, T., Li, S.: Hybrid global-local representation with augmented spatial guidance for zero-shot referring image segmentation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 29634–29643 (2025)

  25. [25]

    Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025

    Liu, Y., Ma, M., Yu, X., Ding, P., Zhao, H., Sun, M., Huang, S., Wang, D.: Ssr: Enhancing depth perception in vision-language models via rationale-guided spatial reasoning. arXiv preprint arXiv:2505.12448 (2025)

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Y., Zhang, C., Wang, Y., Wang, J., Yang, Y., Tang, Y.: Universal segmen- tation at arbitrary granularity with language instruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3459– 3469 (2024)

  27. [27]

    arXiv preprint arXiv:2503.06520 (2025)

    Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025)

  28. [28]

    In: International Conference on Learning Repre- sentations (2023)

    Lu, P., Bansal, H., Xia, T., Liu, J., yue Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foun- dation models in visual contexts. In: International Conference on Learning Repre- sentations (2023)

  29. [29]

    Tarot-SAM3 17 In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Lu, Y., Cao, J., Wu, Y., Li, B., Tang, L., Ji, Y., Wu, C., Wu, J., Zhu, W.: Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. Tarot-SAM3 17 In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 14699–14716 (2025)

  30. [30]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

  31. [31]

    Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777, 2023

    Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., Zuo, W.: Ref-diff: Zero-shot refer- ring image segmentation with generative models. arXiv preprint arXiv:2308.16777 (2023)

  32. [32]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13009–13018 (2024)

  34. [34]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

  35. [35]

    Available at SSRN 5233953 (2025)

    Sapkota, R., Karkee, M.: Object detection with multimodal large vision-language models: An in-depth review. Available at SSRN 5233953 (2025)

  36. [36]

    Sim’eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., J’egou, H., Labatut, P., Bojanowski, P.: Dinov3 (2025)

  37. [37]

    Companion Proceed- ings of the ACM on Web Conference 2025 (2025)

    Song, J., Hua, Z., Zan, H., Han, Y., Peng, M.: Optimizing discriminative vision- language models for efficient multimodal intent recognition. Companion Proceed- ings of the ACM on Web Conference 2025 (2025)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sun, S., Li, R., Torr, P., Gu, X., Li, S.: Clip as rnn: Segment countless visual concepts without training endeavor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13171–13182 (2024)

  39. [39]

    In: Findings of the Association for Computational Linguistics: EMNLP 2023

    Suo, Y., Zhu, L., Yang, Y.: Text augmented spatial aware zero-shot referring im- age segmentation. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 1032–1043 (2023)

  40. [40]

    arXiv preprint arXiv:2508.04655 (2025)

    Wang, H., Qiao, L., Jie, Z., Huang, Z., Feng, C., Zheng, Q., Ma, L., Lan, X., Liang, X.: X-sam: From segment anything to any segmentation. arXiv preprint arXiv:2508.04655 (2025)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, J., Ke, L.: Llm-seg: Bridging image segmentation and large language model reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1765–1774 (2024)

  42. [42]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Wang, S., Kim, D., Taalimi, A., Sun, C., Kuo, W.: Learning visual grounding from generative vision and language model. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 8057–8067. IEEE (2025)

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, W., Yue, T., Zhang, Y., Guo, L., He, X., Wang, X., Liu, J.: Unveiling parts beyond objects: Towards finer-granularity referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12998–13008 (2024)

  44. [44]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wang, Y., Ni, J., Liu, Y., Yuan, C., Tang, Y.: Iterprime: Zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 8159– 8168 (2025) 18 Zhang. Author et al

  45. [45]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: Cris: Clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11686–11695 (2022)

  46. [46]

    Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024

    Wei, C., Zhong, Y., Tan, H., Liu, Y., Zhao, Z., Hu, J., Yang, Y.: Hyperseg: To- wards universal visual segmentation with large language model. arXiv preprint arXiv:2411.17606 (2024)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, J., Jiang, Y., Liu, Q., Yuan, Z., Bai, X., Bai, S.: General object foundation model for images and videos at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3783–3795 (2024)

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmen- tation via multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3858–3869 (2024)

  49. [49]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xie, Y., Yang, K., An, X., Wu, K., Zhao, Y., Deng, W., Ran, Z., Wang, Y., Feng, Z., Miles, R., et al.: Region-based cluster discrimination for visual representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1793–1803 (2025)

  50. [50]

    ArXivabs/2512.01789(2025)

    Xiong, X., Wu, Z., Lu, L., Xia, Y.: Sam3-unet: Simplified adaptation of segment anything model 3. ArXivabs/2512.01789(2025)

  51. [51]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., Li, G.: Bridging vision and lan- guage encoders: Parameter-efficient tuning for referring image segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 17503–17512 (2023)

  52. [52]

    Lisa++: An improved baseline for reasoning segmentation with large language model,

    Yang, S., Qu, T., Lai, X., Tian, Z., Peng, B., Liu, S., Jia, J.: Lisa++: An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240 (2023)

  53. [53]

    IEEE Transactions on Multimedia24, 3611–3623 (2021)

    Yang, S., Wang, Y., Chen, K., Zeng, W., Fei, Z.: Attribute-aware feature encoding for object recognition and segmentation. IEEE Transactions on Multimedia24, 3611–3623 (2021)

  54. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: Lavt: Language- aware vision transformer for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18155– 18165 (2022)

  55. [55]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXivabs/2303.11381(2023)

  56. [56]

    ArXivabs/2512.24702 (2025)

    Ye, K., You, X., Lin, J., Ji, J., Dai, P., Cao, L.: Evolving, not training: Zero- shot reasoning segmentation via evolutionary prompting. ArXivabs/2512.24702 (2025)

  57. [57]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10502–10511 (2019)

  58. [58]

    ACM Computing Surveys (2024)

    Yi, Z., Ouyang, J., Xu, Z., Liu, Y., Liao, T., Luo, H., Shen, Y.: A survey on re- cent advances in llm-based multi-turn dialogue systems. ACM Computing Surveys (2024)

  59. [59]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: Mattnet: Mod- ular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1307–1315 (2018)

  60. [60]

    In: European conference on computer vision

    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in refer- ring expressions. In: European conference on computer vision. pp. 69–85. Springer (2016) Tarot-SAM3 19

  61. [61]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yu, S., Seo, P.H., Son, J.: Zero-shot referring image segmentation with global-local context features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19456–19465 (2023)

  62. [62]

    In: European Conference on Computer Vision

    Yu, S., Seo, P.H., Son, J.: Pseudo-ris: Distinctive pseudo-supervision generation for referring image segmentation. In: European Conference on Computer Vision. pp. 18–36. Springer (2024)

  63. [63]

    arXiv preprint arXiv:2406.20076 (2024)

    Zhang, Y., Cheng, T., Zhu, L., Hu, R., Liu, L., Liu, H., Ran, L., Chen, X., Liu, W., Wang, X.: Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076 (2024)

  64. [64]

    In: European Conference on Computer Vision

    Zhang, Z., Ma, Y., Zhang, E., Bai, X.: Psalm: Pixelwise segmentation with large multi-modal model. In: European Conference on Computer Vision. pp. 74–91. Springer (2024)

  65. [65]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a generalist model for embodied navigation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 13624–13634 (2023)

  66. [66]

    arXiv preprint arXiv:2508.14153 (2025)

    Zhu, L., Ouyang, B., Zhang, Y., Cheng, T., Hu, R., Shen, H., Ran, L., Chen, X., Yu, L., Liu, W., et al.: Lens: Learning to segment anything with unified reinforced reasoning. arXiv preprint arXiv:2508.14153 (2025)

  67. [67]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhu, M., Tian, Y., Chen, H., Zhou, C., Guo, Q., Liu, Y., Yang, M., Shen, C.: Segagent: Exploring pixel understanding capabilities in mllms by imitating hu- man annotator trajectories. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 3686–3696 (2025)

  68. [68]

    Advances in neural information processing systems36, 19769–19782 (2023)

    Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., Lee, Y.J.: Segment everything everywhere all at once. Advances in neural information processing systems36, 19769–19782 (2023)