arxiv: 2605.09591 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang , Zeqing Wang , Yuxian Li , Xihui Liu , Han Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords promptable segmentationconcept-faithful segmentationcounterfactual evaluationsemantic groundingbenchmarkSAM

0 comments

The pith

Promptable segmentation models often output accurate masks even for misleading prompts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether promptable segmentation models understand the concepts described in text prompts or merely respond to visual patterns in images. It builds the CAFE benchmark by taking original images, keeping the target object mask fixed, and editing attributes to create prompts that point to something else. Tests on multiple models show they often still produce high-quality masks for these misleading prompts. This finding indicates that current ways of measuring segmentation success overlook whether the model is actually following the concept in the prompt. If true, it means many applications depending on precise prompt-based control could produce unreliable results when descriptions are not perfect.

Core claim

The central discovery is that there exists a systematic gap between how well models localize objects and how well they discriminate the actual concepts in prompts. This is shown through the CAFE benchmark consisting of 2146 paired samples across three types of counterfactual attribute changes: superficial mimicry, context conflict, and ontological conflict. Models generate accurate masks for negative prompts in many cases, proving that mask accuracy alone does not confirm faithful concept grounding.

What carries the argument

The CAFE benchmark, which creates test cases through attribute-level counterfactual manipulation that preserves the ground-truth mask but changes semantic attributes to generate misleading prompts.

Load-bearing premise

The counterfactual manipulations of attributes preserve the target region and ground-truth mask while introducing misleading semantic cues without confounding changes to object identity or saliency.

What would settle it

Observing that a given model produces substantially lower quality masks specifically for the misleading prompts on the CAFE test set, rather than accurate ones, would indicate the gap does not exist as claimed.

Figures

Figures reproduced from arXiv: 2605.09591 by Han Wang, Shuang Liang, Xihui Liu, Yuxian Li, Zeqing Wang.

**Figure 2.** Figure 2: Examples in our CAFE. Each sample contains a counterfactually edited target image, a ground-truth mask for the target region, a positive prompt that is semantically valid for the target, and a misleading negative prompt that is visually plausible but semantically invalid. The examples cover three attribute-level intervention types: Superficial Mimicry (SM), Ontological Conflict (OC), and Context Conflict (… view at source ↗

**Figure 3.** Figure 3: Overview of CAFE benchmark statistics. CAFE contains 2,146 paired counterfactual samples from three source datasets and spans three edit types: superficial mimicry (SM), context conflict (CC), and ontological conflict (OC). CAFE provides 656 positive prompts and 500 misleading prompts, forming 1,669 prompt pairs whose distribution is highly long-tailed, with 1,447 pair types appearing only once, indicating… view at source ↗

**Figure 4.** Figure 4: Overview of the CAFE dataset annotation pipeline. We draw image-annotation pairs from COCO, SA-Co, and LVIS. The images and annotations are first processed with affine transformations to fit the input size required by Gemini 3, and are then fed into Gemini 3 to generate corresponding editing instructions. Based on the generated instructions, we use nano-banana to perform image editing for all three counter… view at source ↗

**Figure 5.** Figure 5: Data annotation engine used for human quality inspection. Human annotators use the interface to check edit plausibility, mask alignment, and prompt validity during the multi-round filtering process. the sink pixels unchanged. • Why invalid: “beach equipment” is a functional collection noun without a definable geometry, so it cannot be a valid segmentation-level negative prompt (Counterfactual Strength viol… view at source ↗

**Figure 6.** Figure 6: Illustration of category ambiguity and prompt disambiguation in ontological conflicts. Left: an object may validly belong to multiple categories, such as a toy deer belonging to both the toy and deer categories. Right: a cloud with an airplane-like shape visually resembles an airplane, but it remains a cloud rather than a real airplane. Therefore, CAFE uses precise negative prompts such as “real airplane” … view at source ↗

**Figure 7.** Figure 7: Additional examples from CAFE. Each sample consists of a counterfactually edited image, an inherited target mask, a semantically valid positive prompt, and a visually plausible but semantically invalid misleading prompt. The examples cover Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC), demonstrating the diversity of object categories, prompt pairs, and attribute-level count… view at source ↗

**Figure 8.** Figure 8: IoU-threshold sensitivity of AFPR and ACSR on CAFEval2026. Both metrics are computed for SAM3 at a fixed score threshold t = 0.5 and swept over τ ∈ [0.3, 0.9] with step 0.1. Curves are flat for τ ∈ [0.3, 0.7] across every subset, indicating that the model’s wrong predictions overlap the source target with high IoU (≳ 0.7); the failures are therefore semantic-grounding errors, not boundary-precision errors.… view at source ↗

read the original abstract

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAFE introduces a fresh counterfactual benchmark to test semantic grounding in promptable segmenters, but the edits need explicit validation that they leave low-level mask drivers untouched.

read the letter

The paper's central finding is that models like SAM variants keep producing solid masks on prompts that should mislead them semantically. They built CAFE around 2,146 paired samples where the target region and ground-truth mask stay fixed while attributes get swapped to create three conflict types: superficial mimicry, context conflict, and ontological conflict. The result is a clean demonstration that high mask accuracy does not automatically mean the model is grounding the actual concept in the prompt rather than latching onto visual shortcuts. That distinction matters for any downstream use that depends on prompt fidelity. The benchmark construction itself is the main novelty here. Earlier segmentation evals focused on accuracy or presence, but this one systematically introduces misleading cues at the attribute level while claiming to hold the mask constant. The experiments cover multiple model types and sizes, which gives a useful snapshot of current behavior. The soft spot sits in the counterfactual generation step. The whole argument requires that the edits change only the intended semantic dimension without altering texture gradients, boundary contrast, or global saliency that the model might use for localization. The abstract states the target region and mask are preserved, yet it gives no numbers on human verification of identity preservation, no saliency metrics before and after edits, and no artifact checks. If those controls are missing or weak in the full paper, the observed gap could partly reflect robustness to the particular edits rather than a true failure of concept understanding. Minor details like inter-annotator agreement on the pairs would also help. This is useful reading for anyone working on promptable segmentation or building evaluation suites for it. The question is timely and the setup is straightforward enough that it deserves a serious referee to pressure-test the edit validity and see whether the conclusions survive tighter controls. I would send it to peer review rather than desk reject.

Referee Report

2 major / 3 minor

Summary. The paper introduces CAFE, a benchmark of 2,146 paired samples for testing concept-faithful segmentation in promptable models such as SAM3. Samples are created via attribute-level counterfactual manipulations (Superficial Mimicry, Context Conflict, Ontological Conflict) that preserve the target region and ground-truth mask while modifying surface appearance, context, or material to generate misleading negative prompts. Experiments demonstrate that models frequently produce accurate masks even for negative prompts, indicating a systematic gap between localization quality and semantic grounding.

Significance. If the counterfactuals are shown to be valid, CAFE offers a controlled diagnostic for distinguishing shortcut-driven mask prediction from true concept understanding in segmentation models. This addresses a gap in existing benchmarks focused only on mask accuracy or presence, and could inform development of more semantically grounded promptable models. The scale and categorization into three conflict types are strengths for systematic evaluation.

major comments (2)

[CAFE Benchmark Construction] The central claim—that strong mask prediction does not imply faithful semantic grounding—depends on the counterfactual edits preserving the exact target region and ground-truth mask while introducing only the intended misleading semantic cues without confounding changes to visual saliency, object identity, or low-level features (e.g., texture gradients or boundary contrast). The manuscript describes the construction of 2,146 pairs across SM/CC/OC but provides no details on the sample creation process, inter-annotator agreement, human verification of identity preservation, or quantitative checks (such as saliency metrics or feature similarity between positive/negative pairs). Without these, the observed gap cannot be cleanly attributed to lack of concept grounding rather than robustness to the edits. See the CAFE benchmark description and experimental setup.
[Experiments] Table or figure reporting per-category results (e.g., mask IoU on positive vs. negative prompts) should include controls or ablations demonstrating that the attribute manipulations do not alter the core object in ways that affect mask prediction independently of the prompt. The current high-level outcome leaves open whether the gap reflects semantic failure or edit-induced feature invariance.

minor comments (3)

[Abstract] The abstract states evaluation on 'various model types and sizes' but does not list the specific models (e.g., SAM variants, other promptable segmentors); this should be added for reproducibility.
Consider adding a table or figure with example positive/negative prompt pairs and corresponding masks for each of the three categories (SM, CC, OC) to illustrate the manipulations.
[Related Work] Ensure all citations to related work on counterfactual evaluation in vision and language grounding are included and discussed in the related work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of benchmark validity and experimental rigor. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [CAFE Benchmark Construction] The central claim—that strong mask prediction does not imply faithful semantic grounding—depends on the counterfactual edits preserving the exact target region and ground-truth mask while introducing only the intended misleading semantic cues without confounding changes to visual saliency, object identity, or low-level features (e.g., texture gradients or boundary contrast). The manuscript describes the construction of 2,146 pairs across SM/CC/OC but provides no details on the sample creation process, inter-annotator agreement, human verification of identity preservation, or quantitative checks (such as saliency metrics or feature similarity between positive/negative pairs). Without these, the observed gap cannot be cleanly attributed to lack of concept grounding rather than robustness to the edits. See the CAFE benchmark description and experimental setup.

Authors: We agree that explicit documentation of the construction process is necessary to support the validity of the counterfactuals. In the revised manuscript, we will substantially expand the CAFE Benchmark Construction section with: (1) a detailed step-by-step account of how attribute-level manipulations were performed for each category (Superficial Mimicry, Context Conflict, Ontological Conflict) while preserving the target region and ground-truth mask; (2) the protocol for human verification, including that multiple annotators independently confirmed object identity and mask preservation; (3) inter-annotator agreement statistics (e.g., Cohen's kappa); and (4) quantitative controls such as saliency map similarity (using off-the-shelf saliency models) and feature-level comparisons (CLIP embedding cosine similarity and low-level texture/gradient metrics) between positive and negative pairs. These additions will allow readers to assess whether confounding changes were minimized. revision: yes
Referee: [Experiments] Table or figure reporting per-category results (e.g., mask IoU on positive vs. negative prompts) should include controls or ablations demonstrating that the attribute manipulations do not alter the core object in ways that affect mask prediction independently of the prompt. The current high-level outcome leaves open whether the gap reflects semantic failure or edit-induced feature invariance.

Authors: We accept this point and will improve the experimental presentation. The revised manuscript will include per-category breakdowns (SM, CC, OC) of mask IoU for both positive and negative prompts in an expanded Table 2 or new supplementary figure. We will also add an ablation study evaluating models on the counterfactual images using neutral prompts (e.g., 'the object' or 'segment the main region') to measure whether the edits affect mask quality independently of the semantic content of the prompt. Additionally, we will report results on a small control subset where only low-level visual features are altered without introducing semantic conflict. These controls will help isolate the contribution of concept grounding versus edit-induced invariance. revision: yes

Circularity Check

0 steps flagged

No circularity: external benchmark evaluated on public models

full rationale

The paper defines CAFE via manual attribute-level counterfactual edits that preserve target regions and GT masks, then runs off-the-shelf models (SAM3 and variants) on the 2,146 pairs. No parameters are fitted to the test data, no predictions are made from fitted inputs, and no self-citations or uniqueness theorems are invoked to justify the central claim. The reported gap between localization quality and concept discrimination is a direct empirical observation on an independently constructed benchmark, not a reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical benchmark creation effort with no mathematical derivations, fitted parameters, or postulated entities; it relies on standard assumptions about image editing preserving object identity.

pith-pipeline@v0.9.0 · 5590 in / 1058 out tokens · 68993 ms · 2026-05-12T03:07:18.824803+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 3 internal anchors

[1]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017

work page 2017
[3]

Masklab: Instance segmentation by refining object detection with semantic and direction features

Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. Masklab: Instance segmentation by refining object detection with semantic and direction features. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4013–4022, 2018

work page 2018
[4]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024

work page 2024
[5]

Scaling open-vocabulary image segmentation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. InEuropean conference on computer vision, pages 540–557. Springer, 2022

work page 2022
[6]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019

work page 2019
[7]

Simultaneous detection and segmentation

Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Simultaneous detection and segmentation. InEuropean conference on computer vision, pages 297–312. Springer, 2014

work page 2014
[8]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017

work page 2017
[9]

Learning the difference that makes a difference with counterfactually-augmented data

Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. Learning the difference that makes a difference with counterfactually-augmented data. InInternational Conference on Learning Representations. 10

work page
[10]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

work page 2014
[11]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019

work page 2019
[12]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

work page 2023
[13]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[14]

Counterfactual fairness.Advances in neural information processing systems, 30, 2017

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness.Advances in neural information processing systems, 30, 2017

work page 2017
[15]

Naturalbench: Evaluating vision-language models on natural adversarial samples.Advances in Neural Information Processing Systems, 37:17044–17068, 2024

Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, and Deva Ramanan. Naturalbench: Evaluating vision-language models on natural adversarial samples.Advances in Neural Information Processing Systems, 37:17044–17068, 2024

work page 2024
[16]

Hallusegbench: Counterfactual visual reasoning for segmentation hallucination evaluation.arXiv e-prints, pages arXiv–2506, 2025

XinzhuoLi,AdheeshJuvekar,XingyouLiu,MuntasirWahed,KietANguyen,andIsminiLourentzou. Hallusegbench: Counterfactual visual reasoning for segmentation hallucination evaluation.arXiv e-prints, pages arXiv–2506, 2025

work page 2025
[17]

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A Nguyen, Yifan Shen, Tianjiao Yu, and Ismini Lourentzou. Counterfactual segmentation reasoning: Diagnosing and mitigating pixel-grounding hallucination.arXiv preprint arXiv:2506.21546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[19]

Gres: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023

work page 2023
[20]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

work page 2024
[21]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

work page 2015
[22]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

work page 2016
[23]

One-shot instance segmentation

Claudio Michaelis, Ivan Ustyuzhaninov, Matthias Bethge, and Alexander S Ecker. One-shot instance segmentation. arXiv preprint arXiv:1811.11507, 2018

work page arXiv 2018
[24]

Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023

work page 2023
[25]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations. 11

work page
[26]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Is this generated person existedinreal-world? fine-graineddetectingandcalibratingabnormalhuman-body

Zeqing Wang, Qingyang Ma, Wentao Wan, Haojie Li, Keze Wang, and Yonghong Tian. Is this generated person existedinreal-world? fine-graineddetectingandcalibratingabnormalhuman-body. InProceedingsoftheComputer Vision and Pattern Recognition Conference, pages 21226–21237, 2025

work page 2025
[28]

Phydetex: Detecting and explaining the physical plausibility of t2v models.arXiv preprint arXiv:2512.01843, 2025

Zeqing Wang, Keze Wang, and Lei Zhang. Phydetex: Detecting and explaining the physical plausibility of t2v models.arXiv preprint arXiv:2512.01843, 2025

work page arXiv 2025
[29]

Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

ZeqingWang,XinyuWei,BairuiLi,ZhenGuo,JinruiZhang,HongyangWei,KezeWang,andLeiZhang. Videoverse: How far is your t2v generator from a world model?arXiv preprint arXiv:2510.08398, 2025

work page arXiv 2025
[30]

Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025

Zeqing Wang, Shiyuan Zhang, Chengpei Tang, and Keze Wang. Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025

work page arXiv 2025
[31]

Towards top-down reasoning: An explainable multi-agent approach for visual question answering

Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Xiao Wang, Feng Gao, Keze Wang, and Liang Lin. Towards top-down reasoning: An explainable multi-agent approach for visual question answering. IEEE Transactions on Multimedia, 2026

work page 2026
[32]

Tiif-bench: How does your t2i model follow your instructions? arXiv preprint arXiv:2506.02161, 2025

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

work page arXiv 2025
[33]

Openworldsam: Extending sam2 for universal image segmentation with language prompts

Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, and Priyadarshini Panda. Openworldsam: Extending sam2 for universal image segmentation with language prompts. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[34]

Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems, 34:12077–12090, 2021

work page 2021
[35]

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2955–2966, 2023

work page 2023
[36]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

work page 2016
[37]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023

work page 2023
[38]

street animal

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International Journal of Computer Vision, 127(3):302–321, 2019. 12 A. Dataset Preparation A.1. CAFE Annotation Pipeline The CAFE annotation pipeline is shown in Fig. 4. To fit the input resolution o...

work page 2019
[39]

If there are multiple instances of the target object class in the image, read the query carefully to determine whether it applies to all instances or just one, and ground accordingly

work page
[40]

a giraffe with its head up

Identify the actual target object the user is asking you to ground. Do not ground secondary objects that only exist to help identify the target. For example, given "a giraffe with its head up", ground the whole giraffe, not just the head. Given "a person holding a blender with their left hand", ground the person, not the blender or hand

work page
[41]

a man carrying a young girl

Do not include masks for objects mentioned only for identification purposes. For example, given "a man carrying a young girl", ground only the man

work page
[42]

something that shows the man is playing golf

Sometimes the target is not directly named but clearly referenced. For example, given "something that shows the man is playing golf" and an image of a man holding a golf club, ground the golf club

work page
[43]

Do not give up and callreport_no_mask due to small technicalities

Carefully examine all details in the image and reason step by step. Do not give up and callreport_no_mask due to small technicalities. Only callreport_no_mask if there are clear, direct contradictions between the query and the image content

work page
[44]

text_prompt

If the query contains typos, grammatical errors, or irrelevant information, reason about the user’s intent based on the image content rather than following the query literally. Available Tools You must call exactly one tool per turn. Enclose the tool call in<tool>...</tool>tags. segment_phrase Use SAM3 to segment all instances of a simple noun phrase in t...

work page
[45]

brown dog

Use simple, direct noun phrases. You may include visual adjectives like color (e.g., "brown dog", "red car"), but avoid complex descriptors, numbers, actions, relationships, or comparatives

work page
[46]

Use the object category instead (e.g., "sign" instead of the text on the sign)

Do not try to ground text, letters, or numbers written on objects. Use the object category instead (e.g., "sign" instead of the text on the sign)

work page
[47]

elementary school teacher

If a phrase produces no masks or incomplete results, try a more general noun phrase. For example, if "elementary school teacher" returns nothing, try "person"

work page
[48]

vase" instead of

Avoid identifying concepts through actions or relationships. Use "vase" instead of "the bigger vase", "dog" instead of "the dog lying down"

work page
[49]

Be creative with synonyms and visual common sense

If results are not what you expected, try a differenttext_prompt. Be creative with synonyms and visual common sense

work page
[50]

sundial" fails, try

For niche objects that produce no masks, try grounding a more general category. For example, if "sundial" fails, try "statue"

work page
[51]

Do not make it long

Keep yourtext_promptconcise. Do not make it long

work page
[52]

Never use the exact sametext_promptmore than once

work page
[53]

person",

When grounding a person, use general phrases like "person", "man", "girl" that refer to the whole person. Do not ground identifying parts or attributes (e.g., do not use "white hat" to find a guy with a white hat)

work page
[54]

birthday greeting

If a previoustext_prompt did not work, think of a new, creative phrase. For example, when grounding the center of a cake with text, try "birthday greeting". 25

work page
[55]

adult person

Always callsegment_phrase with atext_prompt that represents the entire grounding target. Do not use subparts (e.g., use "adult person" not "adult hand")

work page
[56]

If the query refers to one specific instance among several, use the singular category name and then use select_masks_and_returnto pick the correct one

work page
[57]

Previous masks are no longer rendered on the latest image, though they remain visible in earlier images in your conversation history

Every call tosegment_phrase generates a fresh set of masks. Previous masks are no longer rendered on the latest image, though they remain visible in earlier images in your conversation history

work page
[58]

Ignore partial matches

Only ground objects that fully match the query. Ignore partial matches

work page
[59]

Do not propose atext_prompt that covers more area than the query asks for (e.g., do not use "jeans" when asked for broken areas of jeans)

work page
[60]

microphone

Do not propose atext_prompt that covers less area than the query asks for (e.g., do not use "microphone" when asked for the person holding a microphone)

work page
[61]

Try to propose atext_promptthat covers exactly the queried object(s), no more and no less

work page
[62]

mask_indices

Be creative in yourtext_prompt choices. Use synonyms and visual common sense. You have multiple turns, so take your time. examine_masks Zoom into specific mask regions for close-up inspection. Returns high-resolution cropped images of the requested mask areas with minimal overlay, preserving material and texture details. Use this when you need to verify f...

work page
[63]

2.mask_indices must be a non-empty array of valid mask numbers (1 to N, where N is the number of masks in the most recentsegment_phraseresult)

You may only callexamine_masksaftersegment_phrasehas produced masks. 2.mask_indices must be a non-empty array of valid mask numbers (1 to N, where N is the number of masks in the most recentsegment_phraseresult). Out-of-range indices will be ignored

work page
[64]

Use this tool when you need to inspect material, texture, or fine details to determine whether the mask region truly matches the queried concept

work page
[65]

The images are returned in the order you requested, with a text description indicating which mask each image corresponds to

The returned zoom-in images do not have mask number labels to avoid occluding details. The images are returned in the order you requested, with a text description indicating which mask each image corresponds to

work page
[66]

final_answer_masks

You do not need to examine every mask. Only examine the ones where you are uncertain about the concept match. select_masks_and_return Select a subset of (or all) masks from the most recentsegment_phrase result as your final answer. This ends the conversation. Parameters:{"final_answer_masks": [1, 2]} Rules forselect_masks_and_return:

work page
[67]

Only call this when you are confident the selected masks correctly cover the queried concept

work page
[68]

Do not reference masks from earlier calls

Mask numbers refer to the most recentsegment_phraseresult image. Do not reference masks from earlier calls

work page
[69]

The integers infinal_answer_masks must be within range 1 to N (number of masks in the most recent image), with no duplicates

work page
[70]

The selected masks should accurately capture the target object(s) and only the target object(s)

work page
[71]

Before calling this tool, verify that each selected mask matches the original user query (not just the intermediate text_promptyou used forsegment_phrase)

work page
[72]

If the query involves colors, double-check against the original image since mask overlays change object colors

work page
[73]

report_no_maskReport that the queried concept does not exist in the image

If the query involves relative positions, explicitly reason about each mask’s spatial position before selecting. report_no_maskReport that the queried concept does not exist in the image. This ends the conversation. Parameters:{}(empty object) Rules forreport_no_mask:

work page
[74]

Only call this when you have carefully examined the image and determined that no object matches the queried concept

work page
[75]

Use select_masks_and_returninstead

If at any point in your reasoning you identified a matching target, you must not callreport_no_mask. Use select_masks_and_returninstead

work page
[76]

Before calling this tool, re-examine the original image and explicitly restate why no object matches the query

work page
[77]

Be thorough: if the query is slightly inaccurate but a related object exists, ground that object instead of reporting no mask. 26

work page
[78]

name": "tool_name

Do not callreport_no_mask due to minor discrepancies. Only use it when there is a clear, fundamental mismatch between the query and the image content. Response Format Each turn, first provide your reasoning inside<think> tags, then call exactly one tool inside<tool> tags. Do not call multiple tools in one turn. Your response will be programmatically parse...

work page