arxiv: 2501.04001 · v3 · submitted 2025-01-07 · 💻 cs.CV

Recognition: 2 theorem links

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan , Xiangtai Li , Tao Zhang , Yueyi Sun , Zilong Huang , Shilin Xu , Shunping Ji , Yunhai Tong

show 3 more authors

Lu Qi Jiashi Feng Ming-Hsuan Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified multimodal modelreferring segmentationimage and video understandinginstruction tuninggrounded vision languagemask generationcomplex scene benchmark

0 comments

The pith

Sa2VA unifies segmentation and language models for referring tasks on both images and videos using minimal instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Sa2VA as a unified model that combines a video segmentation foundation model with a multimodal language model to enable dense grounded understanding. The system supports referring segmentation and conversation for images and videos after only one-shot tuning by placing all modalities in a shared token space. The language model generates instruction tokens that direct the segmentation component to output precise masks for referred objects. They also create the Ref-SAV dataset with over 72,000 object expressions to improve performance and provide a benchmark for complex video scenes. If correct, this shows a path to general-purpose visual understanding systems that handle both static and moving content without separate specialized tools.

Core claim

By unifying the video segmentation model and the vision-language model in a shared LLM token space, instruction tokens generated by the language model can guide the production of precise masks for referring segmentation and conversation tasks across images and videos.

What carries the argument

The shared LLM token space in which the language model produces instruction tokens to direct the segmentation model for mask generation.

If this is right

Supports referring segmentation and conversation on both images and videos.
Achieves strong performance in referring video object segmentation in complex scenes.
Can be extended to other vision-language models with rapid updates.
Introduces Ref-SAV dataset containing over 72k auto-labeled object expressions for training and 2k validated for benchmarking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such unification might simplify deployment in applications requiring real-time understanding of dynamic scenes.
Minimal tuning could lower the barrier for adapting models to new visual tasks.
Testing on even longer video sequences would reveal if the token guidance scales beyond the current benchmarks.

Load-bearing premise

The generated instruction tokens from the language model are sufficient to produce accurate masks from the segmentation model in complex video scenes without additional task-specific training or architecture changes.

What would settle it

Demonstrating that Sa2VA's mask predictions on the 2k manually validated video objects in Ref-SAV are less accurate than those from separate specialized models for referring video segmentation.

read the original abstract

This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sa2VA is a practical SAM-2 plus MLLM integration that adds the Ref-SAV dataset and reports gains on referring video segmentation, but the token-to-prompt interface stays underspecified.

read the letter

Sa2VA wires SAM-2's video segmentation head to an MLLM so the language model can output tokens that steer mask generation for referring expressions in both images and videos. The main concrete additions are the Ref-SAV dataset (72k auto-labeled expressions plus 2k manually checked ones) and the claim that one-shot instruction tuning is enough once the components share a token space. Experiments focus on referring video object segmentation and show competitive numbers, with the architecture also dropping into Qwen-VL or Intern-VL without major changes. Code release is a plus for anyone who wants to try the setup quickly. That combination of model and data is the part worth taking seriously for applied work on grounded video understanding. The soft spot is the missing description of how LLM hidden states actually become SAM-2 prompts. SAM-2's encoder takes points, boxes, or masks, so some projection or adapter must exist, yet the text gives no equation, diagram, or ablation that isolates its effect. Without that, it is difficult to credit the minimal-tuning story versus dataset-specific tuning or unstated engineering. Failure cases around occlusion, blur, or similar objects also go unexamined. This paper is aimed at researchers who build or fine-tune multimodal video systems and need a working baseline plus a new dataset to start from. It is not a theoretical advance, but the empirical package is concrete enough that a serious editor should send it to referees for a full check on the interface details and the reported gains.

Referee Report

3 major / 2 minor

Summary. The paper introduces Sa2VA, a unified model that marries SAM-2 with LLaVA (and extensible to other VLMs such as Qwen-VL and Intern-VL) to enable dense grounded understanding of both images and videos. It claims to support referring segmentation, conversation, and related tasks via a shared LLM token space in which the LLM generates instruction tokens that directly guide SAM-2's mask decoder, all with only minimal one-shot instruction tuning. The work also contributes the Ref-SAV dataset (>72k auto-labeled object expressions plus a 2k manually validated subset) and reports strong empirical results on referring video object segmentation and related benchmarks.

Significance. If the central integration claim holds, the work would be significant as the first unified architecture for dense grounded image/video understanding that avoids task-specific architectural additions or heavy fine-tuning. The introduction of the Ref-SAV dataset and the public release of code and models are concrete community contributions that could accelerate follow-up research on referring video segmentation.

major comments (3)

[§3] §3 (Model Architecture) and Figure 2: the mapping from LLM-generated instruction tokens to SAM-2 prompt-encoder inputs (points, boxes, or masks) is never specified by equation, diagram, or pseudocode. SAM-2's prompt encoder expects explicit spatial prompts; without an explicit conversion step (linear projection, cross-attention, or adapter), the claim that 'minimal one-shot instruction tuning' suffices cannot be evaluated.
[§4.2] §4.2 and Table 2: no ablation isolates the contribution of the token-to-prompt interface versus dataset engineering or implicit adaptation. The reported gains on Ref-SAV and referring video segmentation therefore cannot be attributed to the claimed unification regime.
[§4.3] §4.3 (Failure-mode analysis): the manuscript provides no quantitative or qualitative analysis of error cases involving occlusion, motion blur, or multiple similar objects, which are precisely the regimes where an underspecified token-to-mask pathway would be expected to degrade.

minor comments (2)

[Abstract] Abstract: 'M LLM' should read 'MLLM'; 'Ref-SAV datasets' is inconsistent with the singular usage elsewhere.
[§5] §5 (Related Work): several recent referring-video-segmentation baselines (e.g., post-2023 works) are missing; the comparison table would be strengthened by their inclusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas for improvement. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and analyses.

read point-by-point responses

Referee: [§3] §3 (Model Architecture) and Figure 2: the mapping from LLM-generated instruction tokens to SAM-2 prompt-encoder inputs (points, boxes, or masks) is never specified by equation, diagram, or pseudocode. SAM-2's prompt encoder expects explicit spatial prompts; without an explicit conversion step (linear projection, cross-attention, or adapter), the claim that 'minimal one-shot instruction tuning' suffices cannot be evaluated.

Authors: We agree that the token-to-prompt mapping requires more explicit specification. In the revised version, we will add a dedicated equation in §3 describing the lightweight adapter (a linear projection followed by a small cross-attention layer) that converts LLM instruction tokens into the spatial prompt embeddings expected by SAM-2's prompt encoder. We will also update Figure 2 with a detailed diagram and pseudocode for this conversion step. This adapter is the only trainable component during the one-shot instruction tuning, which supports our claim of minimal tuning while preserving SAM-2's frozen weights. revision: yes
Referee: [§4.2] §4.2 and Table 2: no ablation isolates the contribution of the token-to-prompt interface versus dataset engineering or implicit adaptation. The reported gains on Ref-SAV and referring video segmentation therefore cannot be attributed to the claimed unification regime.

Authors: We acknowledge the need for clearer attribution. In the revision, we will add an ablation study in §4.2 that compares (1) the full Sa2VA model, (2) a variant using direct text-to-SAM-2 prompting without the token interface, and (3) training without the Ref-SAV dataset. These results will quantify the isolated contribution of the unified token-to-prompt pathway and help substantiate the unification regime. revision: yes
Referee: [§4.3] §4.3 (Failure-mode analysis): the manuscript provides no quantitative or qualitative analysis of error cases involving occlusion, motion blur, or multiple similar objects, which are precisely the regimes where an underspecified token-to-mask pathway would be expected to degrade.

Authors: We will expand §4.3 with a dedicated failure-mode analysis subsection. This will include quantitative metrics (e.g., J&F scores broken down by occlusion level and object similarity) on challenging subsets of Ref-SAV and DAVIS, plus qualitative visualizations of error cases involving occlusion, motion blur, and ambiguous objects. The analysis will discuss how the token interface performs in these regimes and note remaining limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical integration of SAM-2 and MLLM with external benchmarks

full rationale

The paper presents Sa2VA as an architectural marriage of SAM-2 and an MLLM (LLaVA-style) that unifies modalities into a shared token space and generates instruction tokens to guide mask production. All load-bearing claims are supported by reported experiments on Ref-SAV and other benchmarks rather than any closed derivation. No equations appear that define outputs in terms of fitted inputs, no self-citation chain justifies a uniqueness theorem or ansatz, and the mapping from LLM states to SAM-2 prompts is treated as an engineering choice validated externally. The work is therefore self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard deep-learning training assumptions plus the new integration mechanism and dataset; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Standard assumptions in deep learning training and optimization hold for the combined model
Invoked implicitly when claiming that minimal one-shot tuning suffices.

invented entities (1)

Ref-SAV dataset no independent evidence
purpose: Auto-labeled training and benchmark data for referring video object segmentation
New dataset of 72k expressions introduced to support training.

pith-pipeline@v0.9.0 · 5591 in / 1168 out tokens · 24012 ms · 2026-05-16T11:35:57.949381+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
cs.CV 2026-05 conditional novelty 7.0

SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annot...
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
cs.CV 2026-05 unverdicted novelty 7.0

ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
cs.MM 2026-04 unverdicted novelty 7.0

MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
Training a Student Expert via Semi-Supervised Foundation Model Distillation
cs.CV 2026-04 conditional novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
3AM: 3egment Anything with Geometric Consistency in Videos
cs.CV 2026-01 unverdicted novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
Venus-DeFakerOne: Unified Fake Image Detection & Localization
cs.CV 2026-05 unverdicted novelty 6.0

DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

MPerS dynamically mixes semantic guidance from MLLM-generated RS captions with DINOv3 features via MixExperts and Linguistic Query Guided Attention to achieve superior semantic segmentation on three public remote sens...
X2SAM: Any Segmentation in Images and Videos
cs.CV 2026-04 unverdicted novelty 6.0

X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.
Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding
cs.CV 2026-04 unverdicted novelty 4.0

The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.
2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
cs.CV 2026-04 unverdicted novelty 3.0

ASR-SaSaSa2VA turns audio into text via ASR then feeds it to pre-trained referring video segmentation models, achieving 80.7 and second place in the 5th PVUW MeViS-v2-Audio track.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
cs.SD 2026-04 unverdicted novelty 3.0

A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
cs.CV 2026-04 unverdicted novelty 3.0

An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 18 Pith papers · 19 internal anchors

[1]

Vqa: Visual question an- swering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question an- swering. InICCV, 2015

work page 2015
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localiza- tion, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 13 A person is riding a motorcycle on a road. The motorcycle is beside a truck. The truck is also d...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

One token to seg them all: Language instructed reasoning segmentation in videos

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos. NeurIPS, 2024

work page 2024
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

work page 2020
[6]

Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Making large multimodal models understand arbitrary visual prompts. In CVPR, 2024

work page 2024
[7]

End-to-end object detection with trans- formers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with trans- formers. In ECCV, 2020

work page 2020
[8]

Pix2video: Video editing using image diffusion

Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mi- tra. Pix2video: Video editing using image diffusion. In ICCV, 2023

work page 2023
[9]

Revisiting refer- ring expression comprehension evaluation in the era of large multimodal models

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting refer- ring expression comprehension evaluation in the era of large multimodal models. arXiv preprint arXiv:2406.16866, 2024

work page arXiv 2024
[10]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Llava-interactive: An all-in-one demo for image chat, segmenta- tion, generation and editing

Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. Llava-interactive: An all-in-one demo for image chat, segmenta- tion, generation and editing. arXiv preprint arXiv:2311.00571, 2023

work page arXiv 2023
[12]

Deco: Unleashing the potential of convnets for query-based detection and segmentation

Xinghao Chen, Siwei Li, Yijing Yang, and Yunhe Wang. Deco: Unleashing the potential of convnets for query-based detection and segmentation. In ICLR, 2025

work page 2025
[13]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, JiapengLuo, ZhengMa, etal. Howfarareweto gpt-4v? closing the gap to commercial multimodal 14 Figure 5Visualization results on image referring segmentation task. models with open-source suites. arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Schwing, Alexander Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked- attention mask transformer for universal image seg- mentation. InCVPR, 2022

work page 2022
[16]

Putting the object back into video object segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon- Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, 2024

work page 2024
[17]

Xtuner: A toolkit for ef- ficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for ef- ficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023

work page 2023
[18]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023

work page 2023
[19]

Pvuw 2025 challenge report: Advances in pixel-level un- derstanding of complex videos in the wild

Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, and Philip Torr. Pvuw 2025 challenge report: Advances in pixel-level un- derstanding of complex videos in the wild. InCVPR workshop, 2025

work page 2025
[20]

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, BinWang, LinkeOuyang, XilinWei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free- form text-image composition and comprehens...

work page arXiv 2024
[21]

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language mode...

work page 2024
[22]

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating 15 Please segment the object is a dark-colored backpack with light-colored accents, featuring multiple compartments and pockets, securely fastened to an individual's back. The pe...

work page 2024
[23]

Mmbench-video: A long-form multi-shot bench- mark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xi- angyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot bench- mark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024

work page arXiv 2024
[24]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat- Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. InNeurIPS, 2024

work page 2024
[25]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xi- awu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Video object segmentation-based visual servo con- trol and object depth estimation on a mobile robot

Brent Griffin, Victoria Florence, and Jason Corso. Video object segmentation-based visual servo con- trol and object depth estimation on a mobile robot. In WACV, 2020

work page 2020
[28]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024

work page arXiv 2024
[29]

Ga-nav: Ef- ficient terrain segmentation for robot navigation in unstructured outdoor environments.RA-L, 2022

Tianrui Guan, Divya Kothandaraman, Rohan Chandra, Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, and Dinesh Manocha. Ga-nav: Ef- ficient terrain segmentation for robot navigation in unstructured outdoor environments.RA-L, 2022

work page 2022
[30]

They are the main focus of the image, sitting in a canoe and enjoying the serene surroundings

Tianrui Guan, Ruitao Song, Zhixian Ye, and 16 The region marked by region1 is a person wearing a green jacket. They are the main focus of the image, sitting in a canoe and enjoying the serene surroundings. The person is wearing a life jacket, which adds to the safety precautions taken during the canoeing activity. The region marked as region3 is the botto...

work page 2023
[31]

Openvis: Open-vocabulary video instance segmen- tation

Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. Openvis: Open-vocabulary video instance segmen- tation. arXiv preprint arXiv:2305.16835, 2023

work page arXiv 2023
[32]

Free video-llm: Prompt- guided visual perception for efficient training-free video llms

Kai Han, Jianyuan Guo, Yehui Tang, Wei He, En- hua Wu, and Yunhe Wang. Free video-llm: Prompt- guided visual perception for efficient training-free video llms. arXiv preprint arXiv:2410.10441, 2024

work page arXiv 2024
[33]

Animate-a-story: Storytelling with retrieval- augmented video generation

Yingqing He, Menghan Xia, Haoxin Chen, Xi- aodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval- augmented video generation. arXiv preprint arXiv:2307.06940, 2023

work page arXiv 2023
[34]

Vtimellm: Empower llm to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. InCVPR, 2024

work page 2024
[35]

Reason3d: Searching and reasoning 3d segmentation via large language model

Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, and Ming-Hsuan Yang. Reason3d: Searching and reasoning 3d segmentation via large language model. In 3DV, 2025

work page 2025
[36]

Style-a-video: Agile diffusion for arbitrary text- based video style transfer.SPL, 2024

Nisha Huang, Yuxin Zhang, and Weiming Dong. Style-a-video: Agile diffusion for arbitrary text- based video style transfer.SPL, 2024

work page 2024
[37]

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020

work page arXiv 2004
[38]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

work page 2019
[39]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin Hugo Touvron, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Video instance segmentation us- ing inter-frame communication transformers

Sukjun Hwang, Miran Heo, Seoung Wug Oh, and Seon Joo Kim. Video instance segmentation us- ing inter-frame communication transformers. In NeurIPS, 2021

work page 2021
[41]

Memory-space visual prompting for efficient vision-language fine-tuning

Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, and Yunhe Wang. Memory-space visual prompting for efficient vision-language fine-tuning. arXiv preprint arXiv:2405.05615, 2024

work page arXiv 2024
[42]

Referitgame: Referring 17 The object is a silver car, captured from the rear, driving on a busy road at night

Sahar Kazemzadeh, Vicente Ordonez, Mark Mat- ten, and Tamara Berg. Referitgame: Referring 17 The object is a silver car, captured from the rear, driving on a busy road at night. Its brake lights are illuminated, suggesting it is either slowing down or stopped. The car's make is visible on the back, and the license plate is clearly displayed. It is among v...

work page 2014
[43]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

work page 2016
[44]

Video object segmentation with language referring expressions

Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In ACCV, 2018

work page 2018
[45]

Videopoet: A large language model for zero- shot video generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero- shot video generation. InICML, 2024

work page 2024
[46]

Lisa: Rea- soning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Rea- soning segmentation via large language model. In CVPR, 2024

work page 2024
[47]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy vi- sual task transfer.arXivpreprintarXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Seed- bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed- bench: Benchmarking multimodal large language models. In CVPR, 2024

work page 2024
[49]

Aria: An open multimodal native mixture-of-experts model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open mul- 18 timodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024

work page arXiv 2024
[50]

Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InICML, 2022

work page 2022
[51]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InICML, 2023

work page 2023
[52]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video under- standing. arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Univs: Unified and universal video segmen- tation with prompts as queries

Minghan Li, Shuai Li, Xindong Zhang, and Lei Zhang. Univs: Unified and universal video segmen- tation with prompts as queries. InCVPR, 2024

work page 2024
[54]

Tube-link: A flexible cross tube baseline for universal video segmentation

Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guan- gliang Cheng, Jiangmiao Pang, and Chen Change Loy. Tube-link: A flexible cross tube baseline for universal video segmentation. InICCV, 2023

work page 2023
[55]

Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

work page 2024
[56]

Llama- vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama- vid: An image is worth 2 tokens in large language models. In ECCV, 2024

work page 2024
[57]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023

work page 2023
[58]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In EMNLP, 2024

work page 2024
[59]

GRES: Generalized referring expression segmen- tation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmen- tation. In CVPR, 2023

work page 2023
[60]

Lsvos 2025 challenge report: Recent advances in complex video object segmentation

Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, et al. Lsvos 2025 challenge report: Recent advances in complex video object segmentation. ICCV workshop, 2025

work page 2025
[61]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[62]

Improved baselines with visual in- struction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual in- struction tuning. InCVPR, 2024

work page 2024
[63]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/

work page 2024
[64]

Video-p2p: Video editing with cross- attention control

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross- attention control. InCVPR, 2024

work page 2024
[65]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InECCV, 2024

work page 2024
[66]

Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

work page 2024
[67]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

work page 2022
[68]

Video-chatgpt: Towards detailed video understanding via large vi- sion and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vi- sion and language models. InACL, 2024

work page 2024
[69]

Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023

Shehan Munasinghe, Rusiru Thushara, Muham- mad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023

work page arXiv 2023
[70]

Videoglamm: A large multimodal model for pixel-level visual grounding in videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. arXiv preprint arXiv:2411.04923, 2024

work page arXiv 2024
[71]

Context-guided spatial feature reconstruction for efficient semantic segmen- tation

Zhenliang Ni, Xinghao Chen, Yingjie Zhai, Yehui Tang, and Yunhe Wang. Context-guided spatial feature reconstruction for efficient semantic segmen- tation. In ECCV, 2024

work page 2024
[72]

An unified recurrent video object segmentation framework for various surveillance environments.IEEE TIP, 2021

Prashant W Patil, Akshay Dudhane, Ashutosh Kulkarni, Subrahmanyam Murala, Anil Balaji Gonde, and Sunil Gupta. An unified recurrent video object segmentation framework for various surveillance environments.IEEE TIP, 2021

work page 2021
[73]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Generalizable entity grounding via assistance of large language model.arXiv preprint arXiv:2402.02555, 2024

Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, 19 Xiangtai Li, Weidong Guo, Yu Xu, and Ming- Hsuan Yang. Generalizable entity grounding via assistance of large language model.arXiv preprint arXiv:2402.02555, 2024

work page arXiv 2024
[75]

Eve: Efficient multi- modal vision language models with elastic visual experts

Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, and Yunhe Wang. Eve: Efficient multi- modal vision language models with elastic visual experts. arXiv preprint arXiv:2501.04322, 2025

work page arXiv 2025
[76]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InCVPR, 2024

work page 2024
[77]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos. arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In CVPR, 2024

work page 2024
[79]

URVOS: Unified referring video object segmenta- tion network with a large-scale benchmark

Seonguk Seo, Joon-Young Lee, and Bohyung Han. URVOS: Unified referring video object segmenta- tion network with a large-scale benchmark. In ECCV, 2020

work page 2020
[80]

Prompting large language models with answer heuristics for knowledge-based visual question an- swering

Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question an- swering. InCVPR, 2023

work page 2023

Showing first 80 references.