pith. machine review for the scientific record. sign in

arxiv: 2501.04001 · v3 · submitted 2025-01-07 · 💻 cs.CV

Recognition: 2 theorem links

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelreferring segmentationimage and video understandinginstruction tuninggrounded vision languagemask generationcomplex scene benchmark
0
0 comments X

The pith

Sa2VA unifies segmentation and language models for referring tasks on both images and videos using minimal instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Sa2VA as a unified model that combines a video segmentation foundation model with a multimodal language model to enable dense grounded understanding. The system supports referring segmentation and conversation for images and videos after only one-shot tuning by placing all modalities in a shared token space. The language model generates instruction tokens that direct the segmentation component to output precise masks for referred objects. They also create the Ref-SAV dataset with over 72,000 object expressions to improve performance and provide a benchmark for complex video scenes. If correct, this shows a path to general-purpose visual understanding systems that handle both static and moving content without separate specialized tools.

Core claim

By unifying the video segmentation model and the vision-language model in a shared LLM token space, instruction tokens generated by the language model can guide the production of precise masks for referring segmentation and conversation tasks across images and videos.

What carries the argument

The shared LLM token space in which the language model produces instruction tokens to direct the segmentation model for mask generation.

If this is right

  • Supports referring segmentation and conversation on both images and videos.
  • Achieves strong performance in referring video object segmentation in complex scenes.
  • Can be extended to other vision-language models with rapid updates.
  • Introduces Ref-SAV dataset containing over 72k auto-labeled object expressions for training and 2k validated for benchmarking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such unification might simplify deployment in applications requiring real-time understanding of dynamic scenes.
  • Minimal tuning could lower the barrier for adapting models to new visual tasks.
  • Testing on even longer video sequences would reveal if the token guidance scales beyond the current benchmarks.

Load-bearing premise

The generated instruction tokens from the language model are sufficient to produce accurate masks from the segmentation model in complex video scenes without additional task-specific training or architecture changes.

What would settle it

Demonstrating that Sa2VA's mask predictions on the 2k manually validated video objects in Ref-SAV are less accurate than those from separate specialized models for referring video segmentation.

read the original abstract

This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Sa2VA, a unified model that marries SAM-2 with LLaVA (and extensible to other VLMs such as Qwen-VL and Intern-VL) to enable dense grounded understanding of both images and videos. It claims to support referring segmentation, conversation, and related tasks via a shared LLM token space in which the LLM generates instruction tokens that directly guide SAM-2's mask decoder, all with only minimal one-shot instruction tuning. The work also contributes the Ref-SAV dataset (>72k auto-labeled object expressions plus a 2k manually validated subset) and reports strong empirical results on referring video object segmentation and related benchmarks.

Significance. If the central integration claim holds, the work would be significant as the first unified architecture for dense grounded image/video understanding that avoids task-specific architectural additions or heavy fine-tuning. The introduction of the Ref-SAV dataset and the public release of code and models are concrete community contributions that could accelerate follow-up research on referring video segmentation.

major comments (3)
  1. [§3] §3 (Model Architecture) and Figure 2: the mapping from LLM-generated instruction tokens to SAM-2 prompt-encoder inputs (points, boxes, or masks) is never specified by equation, diagram, or pseudocode. SAM-2's prompt encoder expects explicit spatial prompts; without an explicit conversion step (linear projection, cross-attention, or adapter), the claim that 'minimal one-shot instruction tuning' suffices cannot be evaluated.
  2. [§4.2] §4.2 and Table 2: no ablation isolates the contribution of the token-to-prompt interface versus dataset engineering or implicit adaptation. The reported gains on Ref-SAV and referring video segmentation therefore cannot be attributed to the claimed unification regime.
  3. [§4.3] §4.3 (Failure-mode analysis): the manuscript provides no quantitative or qualitative analysis of error cases involving occlusion, motion blur, or multiple similar objects, which are precisely the regimes where an underspecified token-to-mask pathway would be expected to degrade.
minor comments (2)
  1. [Abstract] Abstract: 'M LLM' should read 'MLLM'; 'Ref-SAV datasets' is inconsistent with the singular usage elsewhere.
  2. [§5] §5 (Related Work): several recent referring-video-segmentation baselines (e.g., post-2023 works) are missing; the comparison table would be strengthened by their inclusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas for improvement. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and analyses.

read point-by-point responses
  1. Referee: [§3] §3 (Model Architecture) and Figure 2: the mapping from LLM-generated instruction tokens to SAM-2 prompt-encoder inputs (points, boxes, or masks) is never specified by equation, diagram, or pseudocode. SAM-2's prompt encoder expects explicit spatial prompts; without an explicit conversion step (linear projection, cross-attention, or adapter), the claim that 'minimal one-shot instruction tuning' suffices cannot be evaluated.

    Authors: We agree that the token-to-prompt mapping requires more explicit specification. In the revised version, we will add a dedicated equation in §3 describing the lightweight adapter (a linear projection followed by a small cross-attention layer) that converts LLM instruction tokens into the spatial prompt embeddings expected by SAM-2's prompt encoder. We will also update Figure 2 with a detailed diagram and pseudocode for this conversion step. This adapter is the only trainable component during the one-shot instruction tuning, which supports our claim of minimal tuning while preserving SAM-2's frozen weights. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2: no ablation isolates the contribution of the token-to-prompt interface versus dataset engineering or implicit adaptation. The reported gains on Ref-SAV and referring video segmentation therefore cannot be attributed to the claimed unification regime.

    Authors: We acknowledge the need for clearer attribution. In the revision, we will add an ablation study in §4.2 that compares (1) the full Sa2VA model, (2) a variant using direct text-to-SAM-2 prompting without the token interface, and (3) training without the Ref-SAV dataset. These results will quantify the isolated contribution of the unified token-to-prompt pathway and help substantiate the unification regime. revision: yes

  3. Referee: [§4.3] §4.3 (Failure-mode analysis): the manuscript provides no quantitative or qualitative analysis of error cases involving occlusion, motion blur, or multiple similar objects, which are precisely the regimes where an underspecified token-to-mask pathway would be expected to degrade.

    Authors: We will expand §4.3 with a dedicated failure-mode analysis subsection. This will include quantitative metrics (e.g., J&F scores broken down by occlusion level and object similarity) on challenging subsets of Ref-SAV and DAVIS, plus qualitative visualizations of error cases involving occlusion, motion blur, and ambiguous objects. The analysis will discuss how the token interface performs in these regimes and note remaining limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical integration of SAM-2 and MLLM with external benchmarks

full rationale

The paper presents Sa2VA as an architectural marriage of SAM-2 and an MLLM (LLaVA-style) that unifies modalities into a shared token space and generates instruction tokens to guide mask production. All load-bearing claims are supported by reported experiments on Ref-SAV and other benchmarks rather than any closed derivation. No equations appear that define outputs in terms of fitted inputs, no self-citation chain justifies a uniqueness theorem or ansatz, and the mapping from LLM states to SAM-2 prompts is treated as an engineering choice validated externally. The work is therefore self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard deep-learning training assumptions plus the new integration mechanism and dataset; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Standard assumptions in deep learning training and optimization hold for the combined model
    Invoked implicitly when claiming that minimal one-shot tuning suffices.
invented entities (1)
  • Ref-SAV dataset no independent evidence
    purpose: Auto-labeled training and benchmark data for referring video object segmentation
    New dataset of 72k expressions introduced to support training.

pith-pipeline@v0.9.0 · 5591 in / 1168 out tokens · 24012 ms · 2026-05-16T11:35:57.949381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    cs.CV 2026-01 unverdicted novelty 8.0

    Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

  2. Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

    cs.CV 2026-05 conditional novelty 7.0

    SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annot...

  3. Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

  4. ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

    cs.CV 2026-05 unverdicted novelty 7.0

    ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.

  5. MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

    cs.MM 2026-04 unverdicted novelty 7.0

    MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...

  6. Training a Student Expert via Semi-Supervised Foundation Model Distillation

    cs.CV 2026-04 conditional novelty 7.0

    A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

  7. 3AM: 3egment Anything with Geometric Consistency in Videos

    cs.CV 2026-01 unverdicted novelty 7.0

    3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

  8. Venus-DeFakerOne: Unified Fake Image Detection & Localization

    cs.CV 2026-05 unverdicted novelty 6.0

    DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.

  9. MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    MPerS dynamically mixes semantic guidance from MLLM-generated RS captions with DINOv3 features via MixExperts and Linguistic Query Guided Attention to achieve superior semantic segmentation on three public remote sens...

  10. X2SAM: Any Segmentation in Images and Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.

  11. SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.

  12. Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.

  13. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  14. Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.

  15. 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

    cs.CV 2026-04 unverdicted novelty 3.0

    ASR-SaSaSa2VA turns audio into text via ASR then feeds it to pre-trained referring video segmentation models, achieving 80.7 and second place in the 5th PVUW MeViS-v2-Audio track.

  16. APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

    cs.SD 2026-04 unverdicted novelty 3.0

    A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...

  17. AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method

    cs.CV 2026-04 unverdicted novelty 3.0

    An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.

  18. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

125 extracted references · 125 canonical work pages · cited by 18 Pith papers · 19 internal anchors

  1. [1]

    Vqa: Visual question an- swering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question an- swering. InICCV, 2015

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localiza- tion, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 13 A person is riding a motorcycle on a road. The motorcycle is beside a truck. The truck is also d...

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    One token to seg them all: Language instructed reasoning segmentation in videos

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos. NeurIPS, 2024

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

  6. [6]

    Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee

    Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Making large multimodal models understand arbitrary visual prompts. In CVPR, 2024

  7. [7]

    End-to-end object detection with trans- formers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with trans- formers. In ECCV, 2020

  8. [8]

    Pix2video: Video editing using image diffusion

    Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mi- tra. Pix2video: Video editing using image diffusion. In ICCV, 2023

  9. [9]

    Revisiting refer- ring expression comprehension evaluation in the era of large multimodal models

    Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting refer- ring expression comprehension evaluation in the era of large multimodal models. arXiv preprint arXiv:2406.16866, 2024

  10. [10]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024

  11. [11]

    Llava-interactive: An all-in-one demo for image chat, segmenta- tion, generation and editing

    Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. Llava-interactive: An all-in-one demo for image chat, segmenta- tion, generation and editing. arXiv preprint arXiv:2311.00571, 2023

  12. [12]

    Deco: Unleashing the potential of convnets for query-based detection and segmentation

    Xinghao Chen, Siwei Li, Yijing Yang, and Yunhe Wang. Deco: Unleashing the potential of convnets for query-based detection and segmentation. In ICLR, 2025

  13. [13]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024

  14. [14]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, JiapengLuo, ZhengMa, etal. Howfarareweto gpt-4v? closing the gap to commercial multimodal 14 Figure 5Visualization results on image referring segmentation task. models with open-source suites. arXiv preprint arXiv:2404.16821, 2024

  15. [15]

    Schwing, Alexander Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked- attention mask transformer for universal image seg- mentation. InCVPR, 2022

  16. [16]

    Putting the object back into video object segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon- Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, 2024

  17. [17]

    Xtuner: A toolkit for ef- ficiently fine-tuning llm

    XTuner Contributors. Xtuner: A toolkit for ef- ficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023

  18. [18]

    Mevis: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023

  19. [19]

    Pvuw 2025 challenge report: Advances in pixel-level un- derstanding of complex videos in the wild

    Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, and Philip Torr. Pvuw 2025 challenge report: Advances in pixel-level un- derstanding of complex videos in the wild. InCVPR workshop, 2025

  20. [20]

    Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, BinWang, LinkeOuyang, XilinWei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free- form text-image composition and comprehens...

  21. [21]

    Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language mode...

  22. [22]

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating 15 Please segment the object is a dark-colored backpack with light-colored accents, featuring multiple compartments and pockets, securely fastened to an individual's back. The pe...

  23. [23]

    Mmbench-video: A long-form multi-shot bench- mark for holistic video understanding

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xi- angyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot bench- mark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024

  24. [24]

    Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

    Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat- Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. InNeurIPS, 2024

  25. [25]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xi- awu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  26. [26]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024

  27. [27]

    Video object segmentation-based visual servo con- trol and object depth estimation on a mobile robot

    Brent Griffin, Victoria Florence, and Jason Corso. Video object segmentation-based visual servo con- trol and object depth estimation on a mobile robot. In WACV, 2020

  28. [28]

    Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024

    Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024

  29. [29]

    Ga-nav: Ef- ficient terrain segmentation for robot navigation in unstructured outdoor environments.RA-L, 2022

    Tianrui Guan, Divya Kothandaraman, Rohan Chandra, Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, and Dinesh Manocha. Ga-nav: Ef- ficient terrain segmentation for robot navigation in unstructured outdoor environments.RA-L, 2022

  30. [30]

    They are the main focus of the image, sitting in a canoe and enjoying the serene surroundings

    Tianrui Guan, Ruitao Song, Zhixian Ye, and 16 The region marked by region1 is a person wearing a green jacket. They are the main focus of the image, sitting in a canoe and enjoying the serene surroundings. The person is wearing a life jacket, which adds to the safety precautions taken during the canoeing activity. The region marked as region3 is the botto...

  31. [31]

    Openvis: Open-vocabulary video instance segmen- tation

    Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. Openvis: Open-vocabulary video instance segmen- tation. arXiv preprint arXiv:2305.16835, 2023

  32. [32]

    Free video-llm: Prompt- guided visual perception for efficient training-free video llms

    Kai Han, Jianyuan Guo, Yehui Tang, Wei He, En- hua Wu, and Yunhe Wang. Free video-llm: Prompt- guided visual perception for efficient training-free video llms. arXiv preprint arXiv:2410.10441, 2024

  33. [33]

    Animate-a-story: Storytelling with retrieval- augmented video generation

    Yingqing He, Menghan Xia, Haoxin Chen, Xi- aodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval- augmented video generation. arXiv preprint arXiv:2307.06940, 2023

  34. [34]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. InCVPR, 2024

  35. [35]

    Reason3d: Searching and reasoning 3d segmentation via large language model

    Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, and Ming-Hsuan Yang. Reason3d: Searching and reasoning 3d segmentation via large language model. In 3DV, 2025

  36. [36]

    Style-a-video: Agile diffusion for arbitrary text- based video style transfer.SPL, 2024

    Nisha Huang, Yuxin Zhang, and Weiming Dong. Style-a-video: Agile diffusion for arbitrary text- based video style transfer.SPL, 2024

  37. [37]

    Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

    Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020

  38. [38]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019

  39. [39]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Louis Martin Hugo Touvron, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv:2307.09288, 2023

  40. [40]

    Video instance segmentation us- ing inter-frame communication transformers

    Sukjun Hwang, Miran Heo, Seoung Wug Oh, and Seon Joo Kim. Video instance segmentation us- ing inter-frame communication transformers. In NeurIPS, 2021

  41. [41]

    Memory-space visual prompting for efficient vision-language fine-tuning

    Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, and Yunhe Wang. Memory-space visual prompting for efficient vision-language fine-tuning. arXiv preprint arXiv:2405.05615, 2024

  42. [42]

    Referitgame: Referring 17 The object is a silver car, captured from the rear, driving on a busy road at night

    Sahar Kazemzadeh, Vicente Ordonez, Mark Mat- ten, and Tamara Berg. Referitgame: Referring 17 The object is a silver car, captured from the rear, driving on a busy road at night. Its brake lights are illuminated, suggesting it is either slowing down or stopped. The car's make is visible on the back, and the license plate is clearly displayed. It is among v...

  43. [43]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016

  44. [44]

    Video object segmentation with language referring expressions

    Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In ACCV, 2018

  45. [45]

    Videopoet: A large language model for zero- shot video generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero- shot video generation. InICML, 2024

  46. [46]

    Lisa: Rea- soning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Rea- soning segmentation via large language model. In CVPR, 2024

  47. [47]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy vi- sual task transfer.arXivpreprintarXiv:2408.03326, 2024

  48. [48]

    Seed- bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed- bench: Benchmarking multimodal large language models. In CVPR, 2024

  49. [49]

    Aria: An open multimodal native mixture-of-experts model

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open mul- 18 timodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024

  50. [50]

    Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InICML, 2022

  51. [51]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InICML, 2023

  52. [52]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video under- standing. arXiv preprint arXiv:2305.06355, 2023

  53. [53]

    Univs: Unified and universal video segmen- tation with prompts as queries

    Minghan Li, Shuai Li, Xindong Zhang, and Lei Zhang. Univs: Unified and universal video segmen- tation with prompts as queries. InCVPR, 2024

  54. [54]

    Tube-link: A flexible cross tube baseline for universal video segmentation

    Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guan- gliang Cheng, Jiangmiao Pang, and Chen Change Loy. Tube-link: A flexible cross tube baseline for universal video segmentation. InICCV, 2023

  55. [55]

    Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

    Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024

  56. [56]

    Llama- vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama- vid: An image is worth 2 tokens in large language models. In ECCV, 2024

  57. [57]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023

  58. [58]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In EMNLP, 2024

  59. [59]

    GRES: Generalized referring expression segmen- tation

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmen- tation. In CVPR, 2023

  60. [60]

    Lsvos 2025 challenge report: Recent advances in complex video object segmentation

    Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, et al. Lsvos 2025 challenge report: Recent advances in complex video object segmentation. ICCV workshop, 2025

  61. [61]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  62. [62]

    Improved baselines with visual in- struction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual in- struction tuning. InCVPR, 2024

  63. [63]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/

  64. [64]

    Video-p2p: Video editing with cross- attention control

    Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross- attention control. InCVPR, 2024

  65. [65]

    Llava-plus: Learning to use tools for creating multimodal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InECCV, 2024

  66. [66]

    Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

  67. [67]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

  68. [68]

    Video-chatgpt: Towards detailed video understanding via large vi- sion and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vi- sion and language models. InACL, 2024

  69. [69]

    Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023

    Shehan Munasinghe, Rusiru Thushara, Muham- mad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023

  70. [70]

    Videoglamm: A large multimodal model for pixel-level visual grounding in videos

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. arXiv preprint arXiv:2411.04923, 2024

  71. [71]

    Context-guided spatial feature reconstruction for efficient semantic segmen- tation

    Zhenliang Ni, Xinghao Chen, Yingjie Zhai, Yehui Tang, and Yunhe Wang. Context-guided spatial feature reconstruction for efficient semantic segmen- tation. In ECCV, 2024

  72. [72]

    An unified recurrent video object segmentation framework for various surveillance environments.IEEE TIP, 2021

    Prashant W Patil, Akshay Dudhane, Ashutosh Kulkarni, Subrahmanyam Murala, Anil Balaji Gonde, and Sunil Gupta. An unified recurrent video object segmentation framework for various surveillance environments.IEEE TIP, 2021

  73. [73]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

  74. [74]

    Generalizable entity grounding via assistance of large language model.arXiv preprint arXiv:2402.02555, 2024

    Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, 19 Xiangtai Li, Weidong Guo, Yu Xu, and Ming- Hsuan Yang. Generalizable entity grounding via assistance of large language model.arXiv preprint arXiv:2402.02555, 2024

  75. [75]

    Eve: Efficient multi- modal vision language models with elastic visual experts

    Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, and Yunhe Wang. Eve: Efficient multi- modal vision language models with elastic visual experts. arXiv preprint arXiv:2501.04322, 2025

  76. [76]

    Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InCVPR, 2024

  77. [77]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos. arXiv preprint arXiv:2408.00714, 2024

  78. [78]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In CVPR, 2024

  79. [79]

    URVOS: Unified referring video object segmenta- tion network with a large-scale benchmark

    Seonguk Seo, Joon-Young Lee, and Bohyung Han. URVOS: Unified referring video object segmenta- tion network with a large-scale benchmark. In ECCV, 2020

  80. [80]

    Prompting large language models with answer heuristics for knowledge-based visual question an- swering

    Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question an- swering. InCVPR, 2023

Showing first 80 references.