pith. sign in

arxiv: 2605.26102 · v2 · pith:AVDLC3GHnew · submitted 2026-05-25 · 💻 cs.CV

InstructSAM: Segment Any Instance with Any Instructions

Pith reviewed 2026-06-29 22:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords instruction-driven segmentationmulti-instance segmentationvision-language modelSAM3learnable querieshybrid attentionreferring segmentationInst2Seg dataset
0
0 comments X

The pith

InstructSAM bridges a vision-language model to SAM3 via learnable instance queries so that arbitrary instructions drive single-pass multi-instance segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InstructSAM as a framework that turns instruction-driven instance segmentation into a set-structured query prediction task. It injects a bank of learnable instance queries into a vision-language model where they interact with instruction and visual tokens through hybrid attention, then projects the resulting conditioned queries into SAM3's detector query space. This interface equips SAM3 with high-level instruction understanding and compositional reasoning without altering its core architecture. The authors also release the Inst2Seg dataset that pairs free-form instructions with instance masks to train and benchmark the approach. A 2B-scale version of the model is shown to outperform prior end-to-end methods and SAM3's agentic pipeline on complex instruction-driven and phrase-level referring segmentation tasks while remaining efficient.

Core claim

InstructSAM formulates instruction-driven instance segmentation as a set-structured query prediction problem and proposes an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model and SAM3. A bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information so each query serves as an instance-aware slot. A hybrid-attention mechanism promotes interaction among the queries, visual tokens, and instruction tokens. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass, giving SAM3 high-level instruction un

What carries the argument

The reasoning-to-instance query interface that injects learnable instance queries into the VLM for contextualization with instructions and visuals then projects the conditioned queries into SAM3's detector query space.

If this is right

  • Enables efficient single-pass multi-instance prediction under arbitrary instructions.
  • Achieves strong results on complex instruction-driven and phrase-level referring segmentation benchmarks with only a 2B-scale model.
  • Outperforms prior end-to-end methods and SAM3's agentic pipeline.
  • Supports compositional reasoning and instance-level set prediction without core architecture changes to SAM3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query-injection pattern could be tested on other detector backbones to see whether instruction following generalizes beyond SAM3.
  • Inst2Seg may be reused to train non-SAM3 models on free-form instruction segmentation tasks.
  • Real-time interactive tools could result if the single-pass efficiency holds under streaming video instructions.

Load-bearing premise

The projection of LLM-conditioned queries into SAM3's detector query space produces accurate multi-instance segmentation without any modification to SAM3's core architecture or additional post-processing.

What would settle it

Run the projected queries on a benchmark containing complex instructions that describe multiple overlapping or similar instances and check whether the output masks contain duplicates or omissions that would have been resolved only by post-processing or architecture changes.

read the original abstract

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces InstructSAM, a framework for multi-instance segmentation under arbitrary instructions. It formulates the task as set-structured query prediction by injecting a bank of learnable instance queries into a VLM for contextualization via hybrid attention with instruction and visual tokens, then projects the resulting LLM-conditioned queries into SAM3's detector query space to drive single-pass multi-instance segmentation. The design claims to equip SAM3 with instruction understanding and compositional reasoning without core architecture changes. A new Inst2Seg dataset is constructed to couple free-form instructions with instance masks. Experiments report that only the 2B-scale model achieves strong results on instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline.

Significance. If the query projection interface functions as described, the work would provide a practical, efficient route to instruction-conditioned segmentation by reusing SAM3's existing detector without post-processing or architectural edits. The Inst2Seg benchmark construction is a clear positive contribution for the community studying compositional referring tasks. The single-pass multi-instance capability and the reported scale threshold (only 2B succeeds) are potentially useful empirical observations if supported by detailed controls.

major comments (3)
  1. [Method description (abstract and §3)] Method description (abstract and §3): The projection of LLM-conditioned queries into SAM3's detector query space is stated at a high level only, with no explicit operator, equations, or implementation details provided. This step is load-bearing for the central claim that the approach works 'without modifying its core architecture' and enables accurate multi-instance masks in one forward pass; its sufficiency is assumed rather than demonstrated.
  2. [Experiments (abstract)] Experiments (abstract): No ablation results, error bars, or dataset statistics are reported to isolate the hybrid-attention mechanism or the learnable query bank. Without these controls, the contributions to 'improving instance enumeration and reducing duplicate predictions' cannot be verified as necessary for the reported outperformance.
  3. [Results (abstract)] Results (abstract): The claim that 'only 2B-scale InstructSAM achieves strong results' is presented without comparisons to smaller-scale variants, statistical significance tests, or full benchmark tables. This scale-specific finding is central to the empirical narrative yet rests on unelaborated evidence.
minor comments (2)
  1. The abstract refers to 'SAM3's agentic pipeline' as a baseline without defining its components or implementation in the provided text; a brief description would improve clarity.
  2. Notation for the 'bank of learnable instance queries' and 'hybrid-attention mechanism' is introduced without accompanying equations or pseudocode, which could be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the method description, experimental controls, and empirical claims. We address each point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Method description (abstract and §3)] Method description (abstract and §3): The projection of LLM-conditioned queries into SAM3's detector query space is stated at a high level only, with no explicit operator, equations, or implementation details provided. This step is load-bearing for the central claim that the approach works 'without modifying its core architecture' and enables accurate multi-instance masks in one forward pass; its sufficiency is assumed rather than demonstrated.

    Authors: We agree that the projection step requires more explicit detail to support the no-core-change claim. In the revision we will add the precise operator (a learned linear projection with layer norm) and the corresponding equations in §3, showing the dimensionality mapping and confirming that SAM3's detector remains unmodified. revision: yes

  2. Referee: [Experiments (abstract)] Experiments (abstract): No ablation results, error bars, or dataset statistics are reported to isolate the hybrid-attention mechanism or the learnable query bank. Without these controls, the contributions to 'improving instance enumeration and reducing duplicate predictions' cannot be verified as necessary for the reported outperformance.

    Authors: The full manuscript contains ablations on these components, but we accept that they are not sufficiently highlighted or statistically detailed. We will expand §4 with additional ablations, error bars across runs, and dataset statistics to better isolate the hybrid-attention and query-bank contributions. revision: yes

  3. Referee: [Results (abstract)] Results (abstract): The claim that 'only 2B-scale InstructSAM achieves strong results' is presented without comparisons to smaller-scale variants, statistical significance tests, or full benchmark tables. This scale-specific finding is central to the empirical narrative yet rests on unelaborated evidence.

    Authors: We will revise the results section to include explicit comparisons against 0.5B and 1B variants, report statistical significance where appropriate, and append full benchmark tables. This will substantiate the scale threshold observation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with no derivation chain

full rationale

The paper introduces InstructSAM as an engineering framework that injects learnable queries into a VLM, applies hybrid attention, and projects the resulting queries into SAM3's detector space. All central claims are presented as empirical outcomes on the new Inst2Seg benchmark and existing referring segmentation tasks, with performance attributed to the 2B-scale model. No equations, first-principles derivations, fitted-parameter predictions, or uniqueness theorems appear in the provided text. The projection step is described as a design choice rather than a result derived from prior inputs, and no self-citation load-bearing arguments are invoked. The derivation chain is therefore self-contained as an empirical demonstration rather than a reduction to its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach depends on the existence of a bank of learnable instance queries whose training dynamics are not detailed in the abstract; no other free parameters, axioms, or invented entities are explicitly introduced.

free parameters (1)
  • bank of learnable instance queries
    Injected into the VLM and contextualized with instruction and visual information; their number and initialization are not specified in the abstract.

pith-pipeline@v0.9.1-grok · 5790 in / 1153 out tokens · 30735 ms · 2026-06-29T22:31:24.723620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 12 internal anchors

  1. [1]

    Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

    Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, and Deli Zhao. Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

  2. [2]

    Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

    Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

  3. [3]

    Agentvln: Towards agentic vision-and-language navigation.arXiv preprint arXiv:2603.17670, 2026

    Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, and Shengjun Huang. Agentvln: Towards agentic vision-and-language navigation.arXiv preprint arXiv:2603.17670, 2026

  4. [4]

    What should I wear to protect my hands from hot water and chemicals while scrubbing? yellow rubber gloves

    Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei 13 Which objects are used to add some natural greenery and decoration to the indoor window area? potted plants. What should I wear to protect my hands from hot water and chemicals while scrubbing? yellow rubber gloves. Which items are being used specific...

  5. [5]

    Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

  6. [6]

    HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

    Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenjie Yan, Wenqiao Zhang, Xiaogang Guo, Jun Xiao, et al. Heartcare suite: A unified multimodal ecg suite for dual signal-image modeling 14 What item in the picture can be classified as such a weapon? knife. What part of the picture can help identify the ownership and registration o...

  7. [7]

    Omnict: Towards a unified slice-volume lvlm for comprehensive ct analysis.arXiv preprint arXiv:2602.16110, 2026

    Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, et al. Omnict: Towards a unified slice-volume lvlm for comprehensive ct analysis.arXiv preprint arXiv:2602.16110, 2026

  8. [8]

    LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, and Beng Chin Ooi. Lmms meet object-centric vision: Understanding, segmentation, editing and generation.arXiv preprint arXiv:2604.11789, 2026. 15 Woman with arm up in the air. woman with arm up in the air. Far right person in the background. fa...

  9. [9]

    Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

    Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, et al. Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

  10. [10]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  11. [11]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, 16 Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pages 28085–28128, 2025

  12. [12]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  13. [13]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  14. [14]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  15. [15]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

  16. [16]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

  17. [17]

    Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

    Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

  18. [18]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

  19. [19]

    Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

  20. [20]

    Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

    Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

  21. [21]

    SAM3-I: Segment Anything with Instructions

    Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, et al. Sam3-i: Segment anything with instructions.arXiv preprint arXiv:2512.04585, 2025

  22. [22]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  24. [24]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  25. [25]

    Osprey: Pixel understanding with visual instruction tuning

    Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

  26. [26]

    Videorefer suite: Advancing spatial-temporal object understanding with video llm

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

  27. [27]

    Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025

    Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, and Beng Chin Ooi. Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025. 17

  28. [28]

    Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

  29. [29]

    Hyperllava: Dynamic visual and language expert tuning for multimodal large language models

    Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, et al. Hyperllava: Dynamic visual and language expert tuning for multimodal large language models. arXiv preprint arXiv:2403.13447, 2024

  30. [30]

    Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation

    Zhuonan Wang, Zhenxuan Fan, Siwen Tan, Yu Zhong, Yuqian Yuan, Haoyuan Li, Hao Jiang, Wenqiao Zhang, Feifei Shao, Hongwei Wang, et al. Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26787–26795, 2026

  31. [31]

    CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

    Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, and Yueting Zhuang. Crossview suite: Harnessing cross-view spatial intelligence of mllms with dataset, model and benchmark.arXiv preprint arXiv:2605.18621, 2026

  32. [32]

    IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

    Haoyu Zheng, Tianwei Lin, Wei Wang, Zhuonan Wang, Wenqiao Zhang, Jiaqi Zhu, and Feifei Shao. Iad-unify: A region-grounded unified model for industrial anomaly segmentation, understanding, and generation.arXiv preprint arXiv:2604.12440, 2026

  33. [33]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

  34. [34]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

  35. [35]

    One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

  36. [36]

    Visa: Reasoning video object segmentation via large language models

    Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2024

  37. [37]

    X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

    Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

  38. [38]

    Schwing, Alexander Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, 2022

  39. [39]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  41. [41]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  42. [42]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

  43. [43]

    Hd-epic: A highly-detailed egocentric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

  44. [44]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 18

  45. [45]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

  46. [46]

    Grec: Generalized referring expression comprehension

    Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. Grec: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182, 2023

  47. [47]

    Groundingsuite: Measuring complex multi-granular pixel grounding.arXiv preprint arXiv:2503.10596, 2025

    Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Groundingsuite: Measuring complex multi-granular pixel grounding.arXiv preprint arXiv:2503.10596, 2025

  48. [48]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

  49. [49]

    Psalm: Pixelwise segmentation with large multi-modal model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  50. [50]

    Evf-sam: Early vision-language fusion for text-prompted segment anything model

    Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024

  51. [51]

    Instructseg: Unifying instructed visual segmentation with multi-modal large language models

    Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

  52. [52]

    Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

    Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 976–983. IEEE, 2023. 19