InstructSAM: Segment Any Instance with Any Instructions

Juncheng Li; Jun Xiao; Siliang Tang; Wenqiao Zhang; Wentong Li; Yueting Zhuang; Yuqian Yuan; Yutong Lin; Zhaocheng Li

arxiv: 2605.26102 · v2 · pith:AVDLC3GHnew · submitted 2026-05-25 · 💻 cs.CV

InstructSAM: Segment Any Instance with Any Instructions

Yuqian Yuan , Wentong Li , Zhaocheng Li , Yutong Lin , Juncheng Li , Siliang Tang , Jun Xiao , Yueting Zhuang

show 1 more author

Wenqiao Zhang

This is my paper

Pith reviewed 2026-06-29 22:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords instruction-driven segmentationmulti-instance segmentationvision-language modelSAM3learnable querieshybrid attentionreferring segmentationInst2Seg dataset

0 comments

The pith

InstructSAM bridges a vision-language model to SAM3 via learnable instance queries so that arbitrary instructions drive single-pass multi-instance segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InstructSAM as a framework that turns instruction-driven instance segmentation into a set-structured query prediction task. It injects a bank of learnable instance queries into a vision-language model where they interact with instruction and visual tokens through hybrid attention, then projects the resulting conditioned queries into SAM3's detector query space. This interface equips SAM3 with high-level instruction understanding and compositional reasoning without altering its core architecture. The authors also release the Inst2Seg dataset that pairs free-form instructions with instance masks to train and benchmark the approach. A 2B-scale version of the model is shown to outperform prior end-to-end methods and SAM3's agentic pipeline on complex instruction-driven and phrase-level referring segmentation tasks while remaining efficient.

Core claim

InstructSAM formulates instruction-driven instance segmentation as a set-structured query prediction problem and proposes an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model and SAM3. A bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information so each query serves as an instance-aware slot. A hybrid-attention mechanism promotes interaction among the queries, visual tokens, and instruction tokens. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass, giving SAM3 high-level instruction un

What carries the argument

The reasoning-to-instance query interface that injects learnable instance queries into the VLM for contextualization with instructions and visuals then projects the conditioned queries into SAM3's detector query space.

If this is right

Enables efficient single-pass multi-instance prediction under arbitrary instructions.
Achieves strong results on complex instruction-driven and phrase-level referring segmentation benchmarks with only a 2B-scale model.
Outperforms prior end-to-end methods and SAM3's agentic pipeline.
Supports compositional reasoning and instance-level set prediction without core architecture changes to SAM3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-injection pattern could be tested on other detector backbones to see whether instruction following generalizes beyond SAM3.
Inst2Seg may be reused to train non-SAM3 models on free-form instruction segmentation tasks.
Real-time interactive tools could result if the single-pass efficiency holds under streaming video instructions.

Load-bearing premise

The projection of LLM-conditioned queries into SAM3's detector query space produces accurate multi-instance segmentation without any modification to SAM3's core architecture or additional post-processing.

What would settle it

Run the projected queries on a benchmark containing complex instructions that describe multiple overlapping or similar instances and check whether the output masks contain duplicates or omissions that would have been resolved only by post-processing or architecture changes.

read the original abstract

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InstructSAM adds a learnable-query interface through a VLM to drive SAM3 with instructions in one pass, plus a new dataset, but the projection step remains underspecified.

read the letter

Colleague,

The main takeaway is that InstructSAM turns instruction-driven instance segmentation into a set query problem: a bank of learnable queries enters the VLM, gets contextualized via hybrid attention with visual and instruction tokens, and is then projected into SAM3's detector query space for single-pass multi-instance masks. They also release Inst2Seg, a dataset pairing free-form instructions with instance masks.

The design choice that stands out is the explicit reasoning-to-instance query interface combined with the hybrid attention, which is meant to improve instance enumeration and cut down on duplicates. This is a direct attempt to give SAM3 compositional instruction understanding without an agentic pipeline or changes to its core. The dataset looks like a useful addition for training and benchmarking in this space. If the experiments check out, the efficiency claim for the 2B-scale model over prior end-to-end methods and SAM3's agentic setup is the practical result worth noting.

The soft spot is the projection itself. The abstract states that the LLM-conditioned queries are mapped into SAM3's space without core modifications or extra post-processing, yet it supplies no operator details, equations, or ablations that isolate whether that transfer preserves accuracy or introduces misalignment. The result that only the 2B model succeeds could easily trace back to how well that interface is tuned at different scales rather than the overall architecture. The stress-test concern about assuming clean transfer without further work holds on the available information; the lack of visible error bars, dataset statistics, or baseline controls makes it difficult to judge robustness.

This is for CV researchers extending SAM-style models toward language-driven or compositional tasks. Someone working on VLM-detector interfaces or referring segmentation would get value from the query design and the benchmark. It deserves peer review so the projection mechanics and experimental controls can be examined in full.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces InstructSAM, a framework for multi-instance segmentation under arbitrary instructions. It formulates the task as set-structured query prediction by injecting a bank of learnable instance queries into a VLM for contextualization via hybrid attention with instruction and visual tokens, then projects the resulting LLM-conditioned queries into SAM3's detector query space to drive single-pass multi-instance segmentation. The design claims to equip SAM3 with instruction understanding and compositional reasoning without core architecture changes. A new Inst2Seg dataset is constructed to couple free-form instructions with instance masks. Experiments report that only the 2B-scale model achieves strong results on instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline.

Significance. If the query projection interface functions as described, the work would provide a practical, efficient route to instruction-conditioned segmentation by reusing SAM3's existing detector without post-processing or architectural edits. The Inst2Seg benchmark construction is a clear positive contribution for the community studying compositional referring tasks. The single-pass multi-instance capability and the reported scale threshold (only 2B succeeds) are potentially useful empirical observations if supported by detailed controls.

major comments (3)

[Method description (abstract and §3)] Method description (abstract and §3): The projection of LLM-conditioned queries into SAM3's detector query space is stated at a high level only, with no explicit operator, equations, or implementation details provided. This step is load-bearing for the central claim that the approach works 'without modifying its core architecture' and enables accurate multi-instance masks in one forward pass; its sufficiency is assumed rather than demonstrated.
[Experiments (abstract)] Experiments (abstract): No ablation results, error bars, or dataset statistics are reported to isolate the hybrid-attention mechanism or the learnable query bank. Without these controls, the contributions to 'improving instance enumeration and reducing duplicate predictions' cannot be verified as necessary for the reported outperformance.
[Results (abstract)] Results (abstract): The claim that 'only 2B-scale InstructSAM achieves strong results' is presented without comparisons to smaller-scale variants, statistical significance tests, or full benchmark tables. This scale-specific finding is central to the empirical narrative yet rests on unelaborated evidence.

minor comments (2)

The abstract refers to 'SAM3's agentic pipeline' as a baseline without defining its components or implementation in the provided text; a brief description would improve clarity.
Notation for the 'bank of learnable instance queries' and 'hybrid-attention mechanism' is introduced without accompanying equations or pseudocode, which could be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the method description, experimental controls, and empirical claims. We address each point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Method description (abstract and §3)] Method description (abstract and §3): The projection of LLM-conditioned queries into SAM3's detector query space is stated at a high level only, with no explicit operator, equations, or implementation details provided. This step is load-bearing for the central claim that the approach works 'without modifying its core architecture' and enables accurate multi-instance masks in one forward pass; its sufficiency is assumed rather than demonstrated.

Authors: We agree that the projection step requires more explicit detail to support the no-core-change claim. In the revision we will add the precise operator (a learned linear projection with layer norm) and the corresponding equations in §3, showing the dimensionality mapping and confirming that SAM3's detector remains unmodified. revision: yes
Referee: [Experiments (abstract)] Experiments (abstract): No ablation results, error bars, or dataset statistics are reported to isolate the hybrid-attention mechanism or the learnable query bank. Without these controls, the contributions to 'improving instance enumeration and reducing duplicate predictions' cannot be verified as necessary for the reported outperformance.

Authors: The full manuscript contains ablations on these components, but we accept that they are not sufficiently highlighted or statistically detailed. We will expand §4 with additional ablations, error bars across runs, and dataset statistics to better isolate the hybrid-attention and query-bank contributions. revision: yes
Referee: [Results (abstract)] Results (abstract): The claim that 'only 2B-scale InstructSAM achieves strong results' is presented without comparisons to smaller-scale variants, statistical significance tests, or full benchmark tables. This scale-specific finding is central to the empirical narrative yet rests on unelaborated evidence.

Authors: We will revise the results section to include explicit comparisons against 0.5B and 1B variants, report statistical significance where appropriate, and append full benchmark tables. This will substantiate the scale threshold observation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with no derivation chain

full rationale

The paper introduces InstructSAM as an engineering framework that injects learnable queries into a VLM, applies hybrid attention, and projects the resulting queries into SAM3's detector space. All central claims are presented as empirical outcomes on the new Inst2Seg benchmark and existing referring segmentation tasks, with performance attributed to the 2B-scale model. No equations, first-principles derivations, fitted-parameter predictions, or uniqueness theorems appear in the provided text. The projection step is described as a design choice rather than a result derived from prior inputs, and no self-citation load-bearing arguments are invoked. The derivation chain is therefore self-contained as an empirical demonstration rather than a reduction to its own assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach depends on the existence of a bank of learnable instance queries whose training dynamics are not detailed in the abstract; no other free parameters, axioms, or invented entities are explicitly introduced.

free parameters (1)

bank of learnable instance queries
Injected into the VLM and contextualized with instruction and visual information; their number and initialization are not specified in the abstract.

pith-pipeline@v0.9.1-grok · 5790 in / 1153 out tokens · 30735 ms · 2026-06-29T22:31:24.723620+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 12 internal anchors

[1]

Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, and Deli Zhao. Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

work page arXiv 2025
[2]

Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

work page arXiv 2025
[3]

Agentvln: Towards agentic vision-and-language navigation.arXiv preprint arXiv:2603.17670, 2026

Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, and Shengjun Huang. Agentvln: Towards agentic vision-and-language navigation.arXiv preprint arXiv:2603.17670, 2026

work page arXiv 2026
[4]

What should I wear to protect my hands from hot water and chemicals while scrubbing? yellow rubber gloves

Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei 13 Which objects are used to add some natural greenery and decoration to the indoor window area? potted plants. What should I wear to protect my hands from hot water and chemicals while scrubbing? yellow rubber gloves. Which items are being used specific...

2024
[5]

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

work page arXiv 2025
[6]

HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenjie Yan, Wenqiao Zhang, Xiaogang Guo, Jun Xiao, et al. Heartcare suite: A unified multimodal ecg suite for dual signal-image modeling 14 What item in the picture can be classified as such a weapon? knife. What part of the picture can help identify the ownership and registration o...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Omnict: Towards a unified slice-volume lvlm for comprehensive ct analysis.arXiv preprint arXiv:2602.16110, 2026

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, et al. Omnict: Towards a unified slice-volume lvlm for comprehensive ct analysis.arXiv preprint arXiv:2602.16110, 2026

work page arXiv 2026
[8]

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, and Beng Chin Ooi. Lmms meet object-centric vision: Understanding, segmentation, editing and generation.arXiv preprint arXiv:2604.11789, 2026. 15 Woman with arm up in the air. woman with arm up in the air. Far right person in the background. fa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, et al. Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

work page arXiv 2026
[10]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023
[11]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, 16 Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pages 28085–28128, 2025

2025
[12]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

2024
[16]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023
[18]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

2023
[20]

Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

work page arXiv 2023
[21]

SAM3-I: Segment Anything with Instructions

Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, et al. Sam3-i: Segment anything with instructions.arXiv preprint arXiv:2512.04585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[25]

Osprey: Pixel understanding with visual instruction tuning

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

2024
[26]

Videorefer suite: Advancing spatial-temporal object understanding with video llm

Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

2025
[27]

Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, and Beng Chin Ooi. Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025. 17

work page arXiv 2025
[28]

Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

2025
[29]

Hyperllava: Dynamic visual and language expert tuning for multimodal large language models

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, et al. Hyperllava: Dynamic visual and language expert tuning for multimodal large language models. arXiv preprint arXiv:2403.13447, 2024

work page arXiv 2024
[30]

Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation

Zhuonan Wang, Zhenxuan Fan, Siwen Tan, Yu Zhong, Yuqian Yuan, Haoyuan Li, Hao Jiang, Wenqiao Zhang, Feifei Shao, Hongwei Wang, et al. Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26787–26795, 2026

2026
[31]

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, and Yueting Zhuang. Crossview suite: Harnessing cross-view spatial intelligence of mllms with dataset, model and benchmark.arXiv preprint arXiv:2605.18621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

Haoyu Zheng, Tianwei Lin, Wei Wang, Zhuonan Wang, Wenqiao Zhang, Jiaqi Zhu, and Feifei Shao. Iad-unify: A region-grounded unified model for industrial anomaly segmentation, understanding, and generation.arXiv preprint arXiv:2604.12440, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

2024
[34]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

2024
[35]

One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

2024
[36]

Visa: Reasoning video object segmentation via large language models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2024

2024
[37]

X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

work page arXiv 2025
[38]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, 2022

2022
[39]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

2020
[40]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014
[41]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[42]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

2018
[43]

Hd-epic: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

2025
[44]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 18

2014
[45]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

2016
[46]

Grec: Generalized referring expression comprehension

Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. Grec: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182, 2023

work page arXiv 2023
[47]

Groundingsuite: Measuring complex multi-granular pixel grounding.arXiv preprint arXiv:2503.10596, 2025

Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Groundingsuite: Measuring complex multi-granular pixel grounding.arXiv preprint arXiv:2503.10596, 2025

work page arXiv 2025
[48]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

2024
[49]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

2024
[50]

Evf-sam: Early vision-language fusion for text-prompted segment anything model

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024

work page arXiv 2024
[51]

Instructseg: Unifying instructed visual segmentation with multi-modal large language models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

2025
[52]

Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 976–983. IEEE, 2023. 19

2023

[1] [1]

Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, and Deli Zhao. Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

work page arXiv 2025

[2] [2]

Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al. Eoc-bench: Can mllms identify, recall, and forecast objects in an egocentric world?arXiv preprint arXiv:2506.05287, 2025

work page arXiv 2025

[3] [3]

Agentvln: Towards agentic vision-and-language navigation.arXiv preprint arXiv:2603.17670, 2026

Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, and Shengjun Huang. Agentvln: Towards agentic vision-and-language navigation.arXiv preprint arXiv:2603.17670, 2026

work page arXiv 2026

[4] [4]

What should I wear to protect my hands from hot water and chemicals while scrubbing? yellow rubber gloves

Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei 13 Which objects are used to add some natural greenery and decoration to the indoor window area? potted plants. What should I wear to protect my hands from hot water and chemicals while scrubbing? yellow rubber gloves. Which items are being used specific...

2024

[5] [5]

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

work page arXiv 2025

[6] [6]

HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenjie Yan, Wenqiao Zhang, Xiaogang Guo, Jun Xiao, et al. Heartcare suite: A unified multimodal ecg suite for dual signal-image modeling 14 What item in the picture can be classified as such a weapon? knife. What part of the picture can help identify the ownership and registration o...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Omnict: Towards a unified slice-volume lvlm for comprehensive ct analysis.arXiv preprint arXiv:2602.16110, 2026

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, et al. Omnict: Towards a unified slice-volume lvlm for comprehensive ct analysis.arXiv preprint arXiv:2602.16110, 2026

work page arXiv 2026

[8] [8]

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, and Beng Chin Ooi. Lmms meet object-centric vision: Understanding, segmentation, editing and generation.arXiv preprint arXiv:2604.11789, 2026. 15 Woman with arm up in the air. woman with arm up in the air. Far right person in the background. fa...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, et al. Unified personalized understanding, generating and editing.arXiv preprint arXiv:2601.06965, 2026

work page arXiv 2026

[10] [10]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023

[11] [11]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, 16 Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In International Conference on Learning Representations, volume 2025, pages 28085–28128, 2025

2025

[12] [12]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

2024

[16] [16]

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023

[18] [18]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality.Advances in Neural Information Processing Systems, 36:29914–29934, 2023

2023

[20] [20]

Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

work page arXiv 2023

[21] [21]

SAM3-I: Segment Anything with Instructions

Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, et al. Sam3-i: Segment anything with instructions.arXiv preprint arXiv:2512.04585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[25] [25]

Osprey: Pixel understanding with visual instruction tuning

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024

2024

[26] [26]

Videorefer suite: Advancing spatial-temporal object understanding with video llm

Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

2025

[27] [27]

Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, and Beng Chin Ooi. Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603, 2025. 17

work page arXiv 2025

[28] [28]

Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, 133(10):6794–6812, 2025

2025

[29] [29]

Hyperllava: Dynamic visual and language expert tuning for multimodal large language models

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, et al. Hyperllava: Dynamic visual and language expert tuning for multimodal large language models. arXiv preprint arXiv:2403.13447, 2024

work page arXiv 2024

[30] [30]

Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation

Zhuonan Wang, Zhenxuan Fan, Siwen Tan, Yu Zhong, Yuqian Yuan, Haoyuan Li, Hao Jiang, Wenqiao Zhang, Feifei Shao, Hongwei Wang, et al. Mau-gpt: Enhancing multi-type industrial anomaly understanding via anomaly-aware and generalist experts adaptation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26787–26795, 2026

2026

[31] [31]

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, and Yueting Zhuang. Crossview suite: Harnessing cross-view spatial intelligence of mllms with dataset, model and benchmark.arXiv preprint arXiv:2605.18621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

Haoyu Zheng, Tianwei Lin, Wei Wang, Zhuonan Wang, Wenqiao Zhang, Jiaqi Zhu, and Feifei Shao. Iad-unify: A region-grounded unified model for industrial anomaly segmentation, understanding, and generation.arXiv preprint arXiv:2604.12440, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024

2024

[34] [34]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

2024

[35] [35]

One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

2024

[36] [36]

Visa: Reasoning video object segmentation via large language models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2024

2024

[37] [37]

X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

work page arXiv 2025

[38] [38]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, 2022

2022

[39] [39]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

2020

[40] [40]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014

[41] [41]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022

[42] [42]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

2018

[43] [43]

Hd-epic: A highly-detailed egocentric video dataset

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025

2025

[44] [44]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 18

2014

[45] [45]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

2016

[46] [46]

Grec: Generalized referring expression comprehension

Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. Grec: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182, 2023

work page arXiv 2023

[47] [47]

Groundingsuite: Measuring complex multi-granular pixel grounding.arXiv preprint arXiv:2503.10596, 2025

Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Groundingsuite: Measuring complex multi-granular pixel grounding.arXiv preprint arXiv:2503.10596, 2025

work page arXiv 2025

[48] [48]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

2024

[49] [49]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

2024

[50] [50]

Evf-sam: Early vision-language fusion for text-prompted segment anything model

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model. arXiv preprint arXiv:2406.20076, 2024

work page arXiv 2024

[51] [51]

Instructseg: Unifying instructed visual segmentation with multi-modal large language models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

2025

[52] [52]

Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes

Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 976–983. IEEE, 2023. 19

2023