GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation and Data Engine

Wen Li; Yanan Wang; Yibin Ying; Zhenghao Fei

arxiv: 2607.00544 · v1 · pith:SFTMJJTYnew · submitted 2026-07-01 · 💻 cs.CV

GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation and Data Engine

Yanan Wang , Wen Li , Yibin Ying , Zhenghao Fei This is my paper

Pith reviewed 2026-07-02 14:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords reasoning segmentationexplainable AILLM deductiondata enginezero-shot inferencereferring segmentationmultimodal reasoningsynthetic data

0 comments

The pith

GEAR-Seg decouples segmentation, text description, and LLM deduction to turn implicit reasoning into an explicit logic chain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning segmentation localizes objects from complex implicit queries, but end-to-end models hide the steps inside an opaque box that limits both understanding and scaling. GEAR-Seg splits the task into three separate stages: class-agnostic region finding, conversion of those regions into dense attribute text, and LLM deduction over the text. The separation produces a trackable chain of steps instead of a black-box answer. The same pipeline doubles as a data engine that automatically labels over 38,000 images with 656,000 QA-mask pairs, creating the GEAR-131K benchmark. Distilled lightweight models trained only on this synthetic data reach performance close to models trained on expensive human labels.

Core claim

By decoupling class-agnostic segmentation, semantic description, and Large Language Model deduction, GEAR-Seg converts implicit visual reasoning into an explicit, trackable logic chain. As a zero-shot framework it matches competitive performance on reasoning and referring segmentation benchmarks. The same architecture functions as a scalable data engine that produces the GEAR-131K benchmark containing more than 38k images and 656k QA-mask pairs organized under a manipulation-oriented taxonomy. Distillation experiments show that models trained solely on the automatically generated data approach the accuracy of models trained on human-annotated data.

What carries the argument

Three-stage decoupled pipeline that first extracts class-agnostic regions, then renders each region as attribute-rich text, then applies LLM deduction on the resulting text descriptions.

Load-bearing premise

Converting visual regions into dense attribute-rich text descriptions preserves all information needed for accurate LLM deduction on complex implicit queries without introducing critical omissions or hallucinations.

What would settle it

On a held-out set of complex implicit queries, measure whether GEAR-Seg's LLM deductions systematically miss targets that a direct end-to-end model correctly segments; a large gap would indicate information loss in the text step.

Figures

Figures reproduced from arXiv: 2607.00544 by Wen Li, Yanan Wang, Yibin Ying, Zhenghao Fei.

**Figure 1.** Figure 1: Overview of GEAR-Seg’s multifaceted capabilities. Serving as both a zero-shot inference agent and a scalable data engine, it explicitly translates pixels into text to seamlessly support complex reasoning segmentation, dense referring segmentation, and fine-grained attribute grounding in long-tail domains. Despite rapid progress, current state-of-the-art (SOTA) architectures typically formulate reasoning s… view at source ↗

**Figure 2.** Figure 2: Overview of the GEAR-Seg framework. The agent explicitly decouples the reasoning segmentation task into class-agnostic perception (SAM 2), dense semantic description (DAM), and logic-driven abstraction (LLM), serving as both a zero-shot inference engine and a scalable data generator. effective paradigm. Instead of directly distilling model weights, this approach utilizes a powerful agent as a teacher to a… view at source ↗

**Figure 3.** Figure 3: Overview of the GEAR-Seg data generation pipeline and operational modes. synthesizing these comprehensive modalities to autonomously generate a diverse set of annotations. For each image, the engine outputs a challenging base query, an explicit step-by-step logic chain, and the corresponding precise mask indices, thereby establishing a high-quality benchmark for reasoning segmentation. 4.2 Taxonomy of Reas… view at source ↗

**Figure 4.** Figure 4: Detailed statistics of the GEAR-131K benchmark. (a) Image distribution across source datasets. (b) Proportion of the five specialized reasoning categories. (c) Word cloud illustrating the semantic diversity of the targeted entities. (d) Comprehensive feature comparison against existing reasoning segmentation datasets. Mapillary [20], and ADE20K [37]. Our automated engine initially generated 162k raw propos… view at source ↗

**Figure 5.** Figure 5: Qualitative results of the GEAR-Seg agent. (a) Complex reasoning segmentation on ReasonSeg and LLM-Seg40k. (b) Open-world auto-label extraction across diverse agricultural scenes, showcasing the zero-shot discovery of long-tail categories. (c) Fine-grained maturity grading, demonstrating precise attribute-based grounding under severe occlusion. Plug-and-Play Cognitive Flexibility. Unlike end-to-end black-… view at source ↗

**Figure 6.** Figure 6: Accuracy evaluation and typical failure modes of the GEAR-Seg agent, illustrating cascading errors in complex reasoning and attribute hallucination. making it a highly worthwhile tradeoff for offline dataset generation and complex multi-turn reasoning tasks. 5.4 Knowledge Distillation to End-to-End Models To fully unleash the potential of the massive datasets generated by our data engine, we conduct knowl… view at source ↗

**Figure 7.** Figure 7: Top: Representative examples of the 5-fold linguistic expansion in the GEAR131K dataset. Bottom: Additional dataset visualizations of GEAR-131K. other relevant semantic elements in the scene. Following this instruction, GEARSeg independently analyzes the global visual context, identifies fine-grained categories, and assigns appropriate text labels to all instances. To ensure a rigorous and unbiased zero… view at source ↗

**Figure 8.** Figure 8: Examples of the auto-label extracting capability of GEAR-Seg across diverse MegaFruits datasets. By prompting the model to analyze the scene context, it automatically extracts a set of semantic categories and assigns fine-grained labels to each detected instance, including long-tail objects often missed by human annotators. Seg accurately identifies and labels diseased leaf (fig. 8b), as well as structura… view at source ↗

read the original abstract

Reasoning segmentation requires localizing targets based on complex, implicit queries. Current end-to-end models typically entangle perception and deduction into an opaque black box, severely limiting interpretability and scalability. To address this, we propose GEAR-Seg (Grounded Explainable Agent for Reasoning Segmentation), an explicitly decoupled agent that shifts the paradigm by translating visual pixels into dense, attribute-rich text. By decoupling class-agnostic segmentation, semantic description, and Large Language Model (LLM) deduction, GEAR-Seg transforms implicit reasoning into an explicit, trackable logic chain. As a zero-shot inference framework, it achieves highly competitive performance across diverse reasoning and fine-grained referring segmentation benchmarks. Furthermore, GEAR-Seg inherently functions as a highly scalable data engine. Utilizing this engine, we construct GEAR-131K, a massive benchmark (over 38k images, 656k QA-mask pairs) introducing a multifaceted taxonomy tailored for complex real-world manipulation-oriented reasoning. Finally, distillation experiments demonstrate that lightweight models supervised exclusively by our automated pipeline closely match the upper-bound performance of costly human-annotated baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEAR-Seg's decoupling makes the reasoning chain explicit and produces a useful large dataset, but the text-only LLM step still needs checks on whether it drops spatial details.

read the letter

The main point is that GEAR-Seg splits reasoning segmentation into class-agnostic masks, per-region text captions, and then LLM deduction on the concatenated text. This turns an opaque process into something you can inspect step by step, and the same pipeline doubles as a data engine to generate GEAR-131K.

The new pieces are the 38k-image benchmark with its manipulation-focused taxonomy and the 656k QA-mask pairs. The distillation results also stand out: lightweight models trained only on the auto-generated labels come close to human-annotated baselines. That part shows the pipeline can scale data without constant human labeling, which matters for robotics and scene tasks.

The approach is straightforward and the dataset size is a clear plus. Prior modular vision-language work exists, but the explicit three-stage framing plus the scale of the released benchmark give it a practical edge.

The soft spot sits in the captioning step. Feeding only dense attribute text to the LLM assumes the vision-language model captures every relation and spatial cue the query needs. The stress-test concern is real here: if relative positions, textures, or occlusions get summarized away, the LLM can follow a clean logic chain to the wrong mask. The abstract reports competitive zero-shot numbers, but without ablations on caption quality or failure cases on spatial implicit queries, it is hard to know how often the assumption holds.

This paper is for groups working on explainable VLM pipelines or automated dataset creation for segmentation. Readers who need large manipulation-oriented data will get concrete value from the benchmark construction.

It deserves a serious referee because the dataset and distillation results are substantive even if the information-loss question needs more evidence.

Referee Report

2 major / 1 minor

Summary. The paper proposes GEAR-Seg, a decoupled agent framework for reasoning segmentation that separates class-agnostic mask generation, VLM-based dense semantic description of regions, and LLM-based deduction to produce an explicit, trackable reasoning chain. It claims competitive zero-shot performance on reasoning and fine-grained referring segmentation benchmarks, positions the method as a scalable data engine to create the GEAR-131K benchmark (38k+ images, 656k QA-mask pairs with a manipulation-oriented taxonomy), and reports that lightweight models distilled from the automated pipeline match human-annotated upper bounds.

Significance. If the empirical claims hold with supporting evidence, the work would offer a concrete advance in interpretability for complex vision-language reasoning tasks and a practical route to large-scale automated dataset creation, reducing reliance on costly human annotations while maintaining performance.

major comments (2)

[Abstract] Abstract: The central claims of 'highly competitive performance across diverse reasoning and fine-grained referring segmentation benchmarks' and that 'lightweight models supervised exclusively by our automated pipeline closely match the upper-bound performance of costly human-annotated baselines' are stated without any quantitative tables, benchmark scores, error bars, or ablation results; this absence prevents evaluation of the empirical soundness of the zero-shot and distillation results.
[Section 3] Section 3 (method): The pipeline generates per-region captions via a vision-language model and concatenates them as input to the LLM for deduction; this step assumes the text descriptions preserve all spatial relations, occlusion details, texture gradients, and context needed for accurate deduction on implicit manipulation queries, yet no validation, failure-case analysis, or comparison against direct visual input is provided to test this assumption, which is load-bearing for the 'explicit, trackable logic chain' claim.

minor comments (1)

[Abstract] Abstract: The dataset is described as 'GEAR-131K' with 'over 38k images, 656k QA-mask pairs'; the naming convention and exact scope of the 131K figure should be clarified relative to the reported counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quantitative support in the abstract and validation of the text-based reasoning assumption. We address both points below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'highly competitive performance across diverse reasoning and fine-grained referring segmentation benchmarks' and that 'lightweight models supervised exclusively by our automated pipeline closely match the upper-bound performance of costly human-annotated baselines' are stated without any quantitative tables, benchmark scores, error bars, or ablation results; this absence prevents evaluation of the empirical soundness of the zero-shot and distillation results.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version, we will add specific benchmark scores (e.g., mIoU on reasoning segmentation tasks and comparison to baselines) with references to the main tables, while keeping the abstract concise. This directly addresses the concern about empirical soundness. revision: yes
Referee: [Section 3] Section 3 (method): The pipeline generates per-region captions via a vision-language model and concatenates them as input to the LLM for deduction; this step assumes the text descriptions preserve all spatial relations, occlusion details, texture gradients, and context needed for accurate deduction on implicit manipulation queries, yet no validation, failure-case analysis, or comparison against direct visual input is provided to test this assumption, which is load-bearing for the 'explicit, trackable logic chain' claim.

Authors: The referee correctly notes that the captioning step is central to the explicit chain. The current manuscript does not include a dedicated validation study or direct comparison to visual-input baselines. We will add a new subsection with quantitative comparison of LLM deduction accuracy using VLM captions versus direct image input, plus failure-case analysis on spatial/occlusion details. This will either support the assumption or clarify its limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and self-contained against external benchmarks

full rationale

The paper describes GEAR-Seg as a zero-shot decoupled framework (class-agnostic masks + VLM captions + LLM deduction) evaluated on external reasoning/referring segmentation benchmarks, with the data engine used to generate new GEAR-131K data and distillation results compared to human baselines. No equations, fitted parameters, or predictions are presented that reduce reported performance or claims to the inputs by construction. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. The central claims rest on empirical outcomes and the explicit decoupling architecture rather than self-referential definitions or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can reliably perform deduction from generated text descriptions; no free parameters or invented physical entities are mentioned. Full paper would be needed to audit any additional modeling choices.

axioms (1)

domain assumption Large language models can perform accurate deduction on complex implicit queries when given dense attribute-rich text descriptions of image regions.
Central to the claim that the decoupled pipeline preserves reasoning capability.

invented entities (1)

GEAR-Seg agent no independent evidence
purpose: Explicitly decoupled pipeline for grounded explainable reasoning segmentation
New system architecture proposed by the authors.

pith-pipeline@v0.9.1-grok · 5733 in / 1409 out tokens · 26494 ms · 2026-07-02T14:31:05.121276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 29 canonical work pages · 3 internal anchors

[1]

, year = 2025, journal =

Acharya, D.B., Kuppan, K., Divya, B.: Agentic AI: Autonomous intelligence for complex goals—a comprehensive survey. IEEE Access13, 18912–18936 (2025). https://doi.org/10.1109/ACCESS.2025.3532853

work page doi:10.1109/access.2025.3532853 2025
[2]

In: IEEE Conf

Chen, R., Li, C., Wu, Q., Zhong, Y.Z., Han, P., Li, W., Wei, Y., Zhao, Y.: LLM- Seg: Bridging image segmentation and large language model reasoning. In: IEEE Conf. Comput. Vis. Pattern Recog. Worksh. pp. 1765–1774 (2024).https://doi. org/10.1109/CVPRW63382.2024.00183

work page doi:10.1109/cvprw63382.2024.00183 2024
[3]

Capsfusion: Rethinking image-text data at scale

Chen, X., Hu, J., Chen, Z., Li, Y., Darrell, T., Yu, F., Gao, J.: LISA: Reasoning segmentation via large language models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9579–9589 (2024).https://doi.org/10.1109/CVPR52733.2024.00915

work page doi:10.1109/cvpr52733.2024.00915 2024
[4]

Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: SAM4MLLM: En- hance multi-modal large language model for referring expression segmentation. In: Eur. Conf. Comput. Vis. pp. 323–340 (2024).https://doi.org/10.1007/ 978-3-031-73004-7_19 16 Y. Wang et al

2024
[5]

Capsfusion: Rethinking image-text data at scale

Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: YOLO-World: Real-time open-vocabulary object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 16901–16911 (2024).https://doi.org/10.1109/CVPR52733.2024.01599

work page doi:10.1109/cvpr52733.2024.01599 2024
[6]

Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: Int. Conf. Comput. Vis. pp. 16301–16310 (2021).https://doi.org/10.1109/ICCV48922.2021.01601

work page doi:10.1109/iccv48922.2021.01601 2021
[7]

Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes (VOC) challenge. Int. J. Comput. Vis.88(2), 303– 338 (2010).https://doi.org/10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010
[8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The Llama 3 herd of mod- els. arXiv preprint arXiv:2407.21783 (2024).https://doi.org/10.48550/arXiv. 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[9]

In: IEEE Conf

Gupta, A., Doll´ ar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5356–5364 (2019). https://doi.org/10.1109/CVPR.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019
[10]

Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expres- sions. In: Eur. Conf. Comput. Vis. pp. 108–124 (2016).https://doi.org/10.1007/ 978-3-319-46448-0_7

2016
[11]

Jang, D., Cho, Y., Lee, S., Kim, T., Kim, D.: MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. In: Int. Conf. Learn. Represent. (2025),https://openreview.net/forum?id=mzL19kKE3r

2025
[12]

Kirillov, A., Girshick, R.M., Doll´ ar, P., Mahajan, D.R., et al.: Segment anything. In: Int. Conf. Comput. Vis. pp. 4015–4026 (2023).https://doi.org/10.1109/ ICCV51070.2023.00371

work page arXiv 2023
[13]

In: Lecture Notes in Networks and Systems

Kozlov, A., Lazarevich, I., Shamporov, V., Lyalyushkin, N., Gorbachev, Y.: Neural network compression framework for fast model inference. In: Lecture Notes in Networks and Systems. vol. 285, pp. 240–253 (2021).https://doi.org/10.1007/ 978-3-030-80129-8_17

2021
[14]

In: IEEE Conf

Li, Y., Chen, C., Dai, X., Chen, H.: Overcoming classifier imbalance for long- tail object detection with balanced group softmax. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10988–10997 (2020).https://doi.org/10.1109/CVPR42600. 2020.01100

work page doi:10.1109/cvpr42600 2020
[15]

Lian, L., Ding, Y., Ge, Y., Cui, Y., Yala, A., Darrell, T.: DAM: Describe anything model for detailed localized image and video captioning. In: Int. Conf. Comput. Vis. pp. 21766–21777 (2025)

2025
[16]

Capsfusion: Rethinking image-text data at scale

Liang, Y., Li, C., Zhang, D., Yang, Z., Wang, B., Mei, T.: CogAgent: A visual language model for GUI agents. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14281–14290 (2024).https://doi.org/10.1109/CVPR52733.2024.01354

work page doi:10.1109/cvpr52733.2024.01354 2024
[17]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, C., Ding, H., Jiang, X.: GRES: Generalized referring expression segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 23592–23601 (2023).https: //doi.org/10.1109/CVPR52729.2023.02259

work page doi:10.1109/cvpr52729.2023.02259 2023
[18]

In: IEEE Conf

Liu, Y., Zhang, J., Han, J., Yang, Y., Li, C., Gao, J.: LAVT: Language-aware vision transformer for referring image segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 18134–18144 (2022).https://doi.org/10.1109/CVPR52688. 2022.01762

work page doi:10.1109/cvpr52688 2022
[19]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Doso- vitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., Houlsby, N.: Simple Open-Vocabulary object detection. In: Eur. Conf. Comput. Vis. pp. 728–755 (2022).https://doi.org/10.1007/ 978-3-031-20080-9_42 GEAR-Seg: A Grounded Explainable Agent fo...

2022
[20]

Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The Mapillary Vistas dataset for semantic understanding of street scenes. In: Int. Conf. Comput. Vis. pp. 5122–5130 (2017).https://doi.org/10.1109/ICCV.2017.534

work page doi:10.1109/iccv.2017.534 2017
[21]

P´ erez-Borrero, I., Mar´ ın-Santos, D., Geg´ undez-Arias, M.E., Cort´ es-Ancos, E.: A fast and accurate deep learning method for strawberry instance segmenta- tion. Comput. Electron. Agric.178, 105736 (2020).https://doi.org/10.1016/ j.compag.2020.105736

work page arXiv 2020
[22]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. In: Int. Conf. Learn. Represent. (2024).https://doi.org/10.48550/arXiv.2408. 00714

work page doi:10.48550/arxiv.2408 2024
[23]

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. pp. 3982–3992 (2019).https://doi.org/10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[24]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded SAM: Assembling open-world models for diverse visual tasks. In: arXiv preprint arXiv:2401.14159 (2024).https://doi.org/10.48550/ arXiv.2401.14159

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Capsfusion: Rethinking image-text data at scale

Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: PixelLM: Pixel rea- soning with large multimodal model. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 26374–26383 (2024).https://doi.org/10.1109/CVPR52733.2024.02491

work page doi:10.1109/cvpr52733.2024.02491 2024
[26]

In: IEEE Conf

Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Gen- eralized Intersection Over Union: A metric and a loss for bounding box regres- sion. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 658–666 (2019).https: //doi.org/10.1109/CVPR.2019.00075

work page doi:10.1109/cvpr.2019.00075 2019
[27]

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Med. Image Comput. Comput.-Assist. Intervent. pp. 234–241 (2015).https://doi.org/10.1007/978-3-319-24574-4_28

work page doi:10.1007/978-3-319-24574-4_28 2015
[28]

Sachdeva, N., Dhaliwal, M., Wu, C.J., McAuley, J.: Infinite Recommendation Net- works: A data-centric approach. In: Adv. Neural Inform. Process. Syst. vol. 35, pp. 31292–31305 (2022)

2022
[29]

M., S.: YOLOv8: A novel object detection algorithm with enhanced performance and robustness

Varghese, R., S. M., S.: YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In: International Conference on Artificial Intelligence and Data Sciences. pp. 1–6 (2024).https://doi.org/10.1109/ADICS58448.2024. 10533619

work page doi:10.1109/adics58448.2024 2024
[30]

Pattern Recognition174, 112799 (2026)

Wang, Y., Fei, Z., Li, R., Ying, Y.: Learn from foundation model: Fruit detec- tion model without manual annotation. Pattern Recognition174, 112799 (2026). https://doi.org/10.1016/j.patcog.2025.112799

work page doi:10.1016/j.patcog.2025.112799 2026
[31]

Ego4d: Around the world in 3, 000 hours of egocentric video

Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: CRIS: CLIP-driven referring image segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11676–11685 (2022).https://doi.org/10.1109/CVPR52688.2022.01139

work page doi:10.1109/cvpr52688.2022.01139 2022
[32]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025).https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[33]

In: IEEE Conf

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: MAttNet: Modular attention network for referring expression comprehension. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1307–1315 (2018)

2018
[34]

Agentic AI: A conceptual taxonomy, applica- tions and challenges

Zhang, L., et al.: AI agents vs. Agentic AI: A conceptual taxonomy, applica- tions and challenges. Information Fusion122, 103599 (2025).https://doi.org/ 10.1016/j.inffus.2025.103599

work page doi:10.1016/j.inffus.2025.103599 2025
[35]

Lambourne, Karl D

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., et al.: Rethinking semantic segmentation from a sequence-to-sequence 18 Y. Wang et al. perspective with transformers. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6877–6886 (2021).https://doi.org/10.1109/CVPR46437.2021.00681

work page doi:10.1109/cvpr46437.2021.00681 2021
[36]

In: AAAI

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: AAAI. pp. 12993–13000 (2020). https://doi.org/10.1609/aaai.v34i07.6999

work page doi:10.1609/aaai.v34i07.6999 2020
[37]

5122–5130

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5122– 5130 (2017).https://doi.org/10.1109/CVPR.2017.544

work page doi:10.1109/cvpr.2017.544 2017
[38]

Zhu, L., Chen, T., Xu, Q., Liu, X., Ji, D., Wu, H., Soh, D.W., Liu, J.: Popen: Preference-based optimization and ensemble for LVLM-based reasoning segmen- tation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation 19 A Supplementary Material ...

2025

[1] [1]

, year = 2025, journal =

Acharya, D.B., Kuppan, K., Divya, B.: Agentic AI: Autonomous intelligence for complex goals—a comprehensive survey. IEEE Access13, 18912–18936 (2025). https://doi.org/10.1109/ACCESS.2025.3532853

work page doi:10.1109/access.2025.3532853 2025

[2] [2]

In: IEEE Conf

Chen, R., Li, C., Wu, Q., Zhong, Y.Z., Han, P., Li, W., Wei, Y., Zhao, Y.: LLM- Seg: Bridging image segmentation and large language model reasoning. In: IEEE Conf. Comput. Vis. Pattern Recog. Worksh. pp. 1765–1774 (2024).https://doi. org/10.1109/CVPRW63382.2024.00183

work page doi:10.1109/cvprw63382.2024.00183 2024

[3] [3]

Capsfusion: Rethinking image-text data at scale

Chen, X., Hu, J., Chen, Z., Li, Y., Darrell, T., Yu, F., Gao, J.: LISA: Reasoning segmentation via large language models. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9579–9589 (2024).https://doi.org/10.1109/CVPR52733.2024.00915

work page doi:10.1109/cvpr52733.2024.00915 2024

[4] [4]

Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: SAM4MLLM: En- hance multi-modal large language model for referring expression segmentation. In: Eur. Conf. Comput. Vis. pp. 323–340 (2024).https://doi.org/10.1007/ 978-3-031-73004-7_19 16 Y. Wang et al

2024

[5] [5]

Capsfusion: Rethinking image-text data at scale

Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: YOLO-World: Real-time open-vocabulary object detection. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 16901–16911 (2024).https://doi.org/10.1109/CVPR52733.2024.01599

work page doi:10.1109/cvpr52733.2024.01599 2024

[6] [6]

Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: Int. Conf. Comput. Vis. pp. 16301–16310 (2021).https://doi.org/10.1109/ICCV48922.2021.01601

work page doi:10.1109/iccv48922.2021.01601 2021

[7] [7]

Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes (VOC) challenge. Int. J. Comput. Vis.88(2), 303– 338 (2010).https://doi.org/10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010

[8] [8]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The Llama 3 herd of mod- els. arXiv preprint arXiv:2407.21783 (2024).https://doi.org/10.48550/arXiv. 2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[9] [9]

In: IEEE Conf

Gupta, A., Doll´ ar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5356–5364 (2019). https://doi.org/10.1109/CVPR.2019.00550

work page doi:10.1109/cvpr.2019.00550 2019

[10] [10]

Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expres- sions. In: Eur. Conf. Comput. Vis. pp. 108–124 (2016).https://doi.org/10.1007/ 978-3-319-46448-0_7

2016

[11] [11]

Jang, D., Cho, Y., Lee, S., Kim, T., Kim, D.: MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. In: Int. Conf. Learn. Represent. (2025),https://openreview.net/forum?id=mzL19kKE3r

2025

[12] [12]

Kirillov, A., Girshick, R.M., Doll´ ar, P., Mahajan, D.R., et al.: Segment anything. In: Int. Conf. Comput. Vis. pp. 4015–4026 (2023).https://doi.org/10.1109/ ICCV51070.2023.00371

work page arXiv 2023

[13] [13]

In: Lecture Notes in Networks and Systems

Kozlov, A., Lazarevich, I., Shamporov, V., Lyalyushkin, N., Gorbachev, Y.: Neural network compression framework for fast model inference. In: Lecture Notes in Networks and Systems. vol. 285, pp. 240–253 (2021).https://doi.org/10.1007/ 978-3-030-80129-8_17

2021

[14] [14]

In: IEEE Conf

Li, Y., Chen, C., Dai, X., Chen, H.: Overcoming classifier imbalance for long- tail object detection with balanced group softmax. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 10988–10997 (2020).https://doi.org/10.1109/CVPR42600. 2020.01100

work page doi:10.1109/cvpr42600 2020

[15] [15]

Lian, L., Ding, Y., Ge, Y., Cui, Y., Yala, A., Darrell, T.: DAM: Describe anything model for detailed localized image and video captioning. In: Int. Conf. Comput. Vis. pp. 21766–21777 (2025)

2025

[16] [16]

Capsfusion: Rethinking image-text data at scale

Liang, Y., Li, C., Zhang, D., Yang, Z., Wang, B., Mei, T.: CogAgent: A visual language model for GUI agents. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 14281–14290 (2024).https://doi.org/10.1109/CVPR52733.2024.01354

work page doi:10.1109/cvpr52733.2024.01354 2024

[17] [17]

In: Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, C., Ding, H., Jiang, X.: GRES: Generalized referring expression segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 23592–23601 (2023).https: //doi.org/10.1109/CVPR52729.2023.02259

work page doi:10.1109/cvpr52729.2023.02259 2023

[18] [18]

In: IEEE Conf

Liu, Y., Zhang, J., Han, J., Yang, Y., Li, C., Gao, J.: LAVT: Language-aware vision transformer for referring image segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 18134–18144 (2022).https://doi.org/10.1109/CVPR52688. 2022.01762

work page doi:10.1109/cvpr52688 2022

[19] [19]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Doso- vitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., Houlsby, N.: Simple Open-Vocabulary object detection. In: Eur. Conf. Comput. Vis. pp. 728–755 (2022).https://doi.org/10.1007/ 978-3-031-20080-9_42 GEAR-Seg: A Grounded Explainable Agent fo...

2022

[20] [20]

Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The Mapillary Vistas dataset for semantic understanding of street scenes. In: Int. Conf. Comput. Vis. pp. 5122–5130 (2017).https://doi.org/10.1109/ICCV.2017.534

work page doi:10.1109/iccv.2017.534 2017

[21] [21]

P´ erez-Borrero, I., Mar´ ın-Santos, D., Geg´ undez-Arias, M.E., Cort´ es-Ancos, E.: A fast and accurate deep learning method for strawberry instance segmenta- tion. Comput. Electron. Agric.178, 105736 (2020).https://doi.org/10.1016/ j.compag.2020.105736

work page arXiv 2020

[22] [22]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: SAM 2: Segment anything in images and videos. In: Int. Conf. Learn. Represent. (2024).https://doi.org/10.48550/arXiv.2408. 00714

work page doi:10.48550/arxiv.2408 2024

[23] [23]

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. pp. 3982–3992 (2019).https://doi.org/10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[24] [24]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded SAM: Assembling open-world models for diverse visual tasks. In: arXiv preprint arXiv:2401.14159 (2024).https://doi.org/10.48550/ arXiv.2401.14159

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Capsfusion: Rethinking image-text data at scale

Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: PixelLM: Pixel rea- soning with large multimodal model. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 26374–26383 (2024).https://doi.org/10.1109/CVPR52733.2024.02491

work page doi:10.1109/cvpr52733.2024.02491 2024

[26] [26]

In: IEEE Conf

Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Gen- eralized Intersection Over Union: A metric and a loss for bounding box regres- sion. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 658–666 (2019).https: //doi.org/10.1109/CVPR.2019.00075

work page doi:10.1109/cvpr.2019.00075 2019

[27] [27]

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Med. Image Comput. Comput.-Assist. Intervent. pp. 234–241 (2015).https://doi.org/10.1007/978-3-319-24574-4_28

work page doi:10.1007/978-3-319-24574-4_28 2015

[28] [28]

Sachdeva, N., Dhaliwal, M., Wu, C.J., McAuley, J.: Infinite Recommendation Net- works: A data-centric approach. In: Adv. Neural Inform. Process. Syst. vol. 35, pp. 31292–31305 (2022)

2022

[29] [29]

M., S.: YOLOv8: A novel object detection algorithm with enhanced performance and robustness

Varghese, R., S. M., S.: YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In: International Conference on Artificial Intelligence and Data Sciences. pp. 1–6 (2024).https://doi.org/10.1109/ADICS58448.2024. 10533619

work page doi:10.1109/adics58448.2024 2024

[30] [30]

Pattern Recognition174, 112799 (2026)

Wang, Y., Fei, Z., Li, R., Ying, Y.: Learn from foundation model: Fruit detec- tion model without manual annotation. Pattern Recognition174, 112799 (2026). https://doi.org/10.1016/j.patcog.2025.112799

work page doi:10.1016/j.patcog.2025.112799 2026

[31] [31]

Ego4d: Around the world in 3, 000 hours of egocentric video

Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: CRIS: CLIP-driven referring image segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11676–11685 (2022).https://doi.org/10.1109/CVPR52688.2022.01139

work page doi:10.1109/cvpr52688.2022.01139 2022

[32] [32]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025).https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[33] [33]

In: IEEE Conf

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., Berg, T.L.: MAttNet: Modular attention network for referring expression comprehension. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1307–1315 (2018)

2018

[34] [34]

Agentic AI: A conceptual taxonomy, applica- tions and challenges

Zhang, L., et al.: AI agents vs. Agentic AI: A conceptual taxonomy, applica- tions and challenges. Information Fusion122, 103599 (2025).https://doi.org/ 10.1016/j.inffus.2025.103599

work page doi:10.1016/j.inffus.2025.103599 2025

[35] [35]

Lambourne, Karl D

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S., et al.: Rethinking semantic segmentation from a sequence-to-sequence 18 Y. Wang et al. perspective with transformers. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6877–6886 (2021).https://doi.org/10.1109/CVPR46437.2021.00681

work page doi:10.1109/cvpr46437.2021.00681 2021

[36] [36]

In: AAAI

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: Faster and better learning for bounding box regression. In: AAAI. pp. 12993–13000 (2020). https://doi.org/10.1609/aaai.v34i07.6999

work page doi:10.1609/aaai.v34i07.6999 2020

[37] [37]

5122–5130

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5122– 5130 (2017).https://doi.org/10.1109/CVPR.2017.544

work page doi:10.1109/cvpr.2017.544 2017

[38] [38]

Zhu, L., Chen, T., Xu, Q., Liu, X., Ji, D., Wu, H., Soh, D.W., Liu, J.: Popen: Preference-based optimization and ensemble for LVLM-based reasoning segmen- tation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation 19 A Supplementary Material ...

2025