Recognition: 2 theorem links
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Pith reviewed 2026-05-16 11:35 UTC · model grok-4.3
The pith
Sa2VA unifies segmentation and language models for referring tasks on both images and videos using minimal instruction tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By unifying the video segmentation model and the vision-language model in a shared LLM token space, instruction tokens generated by the language model can guide the production of precise masks for referring segmentation and conversation tasks across images and videos.
What carries the argument
The shared LLM token space in which the language model produces instruction tokens to direct the segmentation model for mask generation.
If this is right
- Supports referring segmentation and conversation on both images and videos.
- Achieves strong performance in referring video object segmentation in complex scenes.
- Can be extended to other vision-language models with rapid updates.
- Introduces Ref-SAV dataset containing over 72k auto-labeled object expressions for training and 2k validated for benchmarking.
Where Pith is reading between the lines
- Such unification might simplify deployment in applications requiring real-time understanding of dynamic scenes.
- Minimal tuning could lower the barrier for adapting models to new visual tasks.
- Testing on even longer video sequences would reveal if the token guidance scales beyond the current benchmarks.
Load-bearing premise
The generated instruction tokens from the language model are sufficient to produce accurate masks from the segmentation model in complex video scenes without additional task-specific training or architecture changes.
What would settle it
Demonstrating that Sa2VA's mask predictions on the 2k manually validated video objects in Ref-SAV are less accurate than those from separate specialized models for referring video segmentation.
read the original abstract
This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Sa2VA, a unified model that marries SAM-2 with LLaVA (and extensible to other VLMs such as Qwen-VL and Intern-VL) to enable dense grounded understanding of both images and videos. It claims to support referring segmentation, conversation, and related tasks via a shared LLM token space in which the LLM generates instruction tokens that directly guide SAM-2's mask decoder, all with only minimal one-shot instruction tuning. The work also contributes the Ref-SAV dataset (>72k auto-labeled object expressions plus a 2k manually validated subset) and reports strong empirical results on referring video object segmentation and related benchmarks.
Significance. If the central integration claim holds, the work would be significant as the first unified architecture for dense grounded image/video understanding that avoids task-specific architectural additions or heavy fine-tuning. The introduction of the Ref-SAV dataset and the public release of code and models are concrete community contributions that could accelerate follow-up research on referring video segmentation.
major comments (3)
- [§3] §3 (Model Architecture) and Figure 2: the mapping from LLM-generated instruction tokens to SAM-2 prompt-encoder inputs (points, boxes, or masks) is never specified by equation, diagram, or pseudocode. SAM-2's prompt encoder expects explicit spatial prompts; without an explicit conversion step (linear projection, cross-attention, or adapter), the claim that 'minimal one-shot instruction tuning' suffices cannot be evaluated.
- [§4.2] §4.2 and Table 2: no ablation isolates the contribution of the token-to-prompt interface versus dataset engineering or implicit adaptation. The reported gains on Ref-SAV and referring video segmentation therefore cannot be attributed to the claimed unification regime.
- [§4.3] §4.3 (Failure-mode analysis): the manuscript provides no quantitative or qualitative analysis of error cases involving occlusion, motion blur, or multiple similar objects, which are precisely the regimes where an underspecified token-to-mask pathway would be expected to degrade.
minor comments (2)
- [Abstract] Abstract: 'M LLM' should read 'MLLM'; 'Ref-SAV datasets' is inconsistent with the singular usage elsewhere.
- [§5] §5 (Related Work): several recent referring-video-segmentation baselines (e.g., post-2023 works) are missing; the comparison table would be strengthened by their inclusion.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us identify areas for improvement. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and analyses.
read point-by-point responses
-
Referee: [§3] §3 (Model Architecture) and Figure 2: the mapping from LLM-generated instruction tokens to SAM-2 prompt-encoder inputs (points, boxes, or masks) is never specified by equation, diagram, or pseudocode. SAM-2's prompt encoder expects explicit spatial prompts; without an explicit conversion step (linear projection, cross-attention, or adapter), the claim that 'minimal one-shot instruction tuning' suffices cannot be evaluated.
Authors: We agree that the token-to-prompt mapping requires more explicit specification. In the revised version, we will add a dedicated equation in §3 describing the lightweight adapter (a linear projection followed by a small cross-attention layer) that converts LLM instruction tokens into the spatial prompt embeddings expected by SAM-2's prompt encoder. We will also update Figure 2 with a detailed diagram and pseudocode for this conversion step. This adapter is the only trainable component during the one-shot instruction tuning, which supports our claim of minimal tuning while preserving SAM-2's frozen weights. revision: yes
-
Referee: [§4.2] §4.2 and Table 2: no ablation isolates the contribution of the token-to-prompt interface versus dataset engineering or implicit adaptation. The reported gains on Ref-SAV and referring video segmentation therefore cannot be attributed to the claimed unification regime.
Authors: We acknowledge the need for clearer attribution. In the revision, we will add an ablation study in §4.2 that compares (1) the full Sa2VA model, (2) a variant using direct text-to-SAM-2 prompting without the token interface, and (3) training without the Ref-SAV dataset. These results will quantify the isolated contribution of the unified token-to-prompt pathway and help substantiate the unification regime. revision: yes
-
Referee: [§4.3] §4.3 (Failure-mode analysis): the manuscript provides no quantitative or qualitative analysis of error cases involving occlusion, motion blur, or multiple similar objects, which are precisely the regimes where an underspecified token-to-mask pathway would be expected to degrade.
Authors: We will expand §4.3 with a dedicated failure-mode analysis subsection. This will include quantitative metrics (e.g., J&F scores broken down by occlusion level and object similarity) on challenging subsets of Ref-SAV and DAVIS, plus qualitative visualizations of error cases involving occlusion, motion blur, and ambiguous objects. The analysis will discuss how the token interface performs in these regimes and note remaining limitations. revision: yes
Circularity Check
No circularity: empirical integration of SAM-2 and MLLM with external benchmarks
full rationale
The paper presents Sa2VA as an architectural marriage of SAM-2 and an MLLM (LLaVA-style) that unifies modalities into a shared token space and generates instruction tokens to guide mask production. All load-bearing claims are supported by reported experiments on Ref-SAV and other benchmarks rather than any closed derivation. No equations appear that define outputs in terms of fitted inputs, no self-citation chain justifies a uniqueness theorem or ansatz, and the mapping from LLM states to SAM-2 prompts is treated as an engineering choice validated externally. The work is therefore self-contained against independent evaluation data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in deep learning training and optimization hold for the combined model
invented entities (1)
-
Ref-SAV dataset
no independent evidence
Forward citations
Cited by 18 Pith papers
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annot...
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
-
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
3AM: 3egment Anything with Geometric Consistency in Videos
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
-
Venus-DeFakerOne: Unified Fake Image Detection & Localization
DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
-
MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation
MPerS dynamically mixes semantic guidance from MLLM-generated RS captions with DINOv3 features via MixExperts and Linguistic Query Guided Attention to achieve superior semantic segmentation on three public remote sens...
-
X2SAM: Any Segmentation in Images and Videos
X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
-
SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
SpatiO uses heterogeneous vision-language agents with test-time orchestration to dynamically weight their contributions for improved spatial reasoning on benchmarks like 3DSRBench and CV-Bench.
-
Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.
-
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
-
Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding
The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.
-
2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
ASR-SaSaSa2VA turns audio into text via ASR then feeds it to pre-trained referring video segmentation models, achieving 80.7 and second place in the 5th PVUW MeViS-v2-Audio track.
-
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...
-
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Reference graph
Works this paper leans on
-
[1]
Vqa: Visual question an- swering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question an- swering. InICCV, 2015
work page 2015
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localiza- tion, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 13 A person is riding a motorcycle on a road. The motorcycle is beside a truck. The truck is also d...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
One token to seg them all: Language instructed reasoning segmentation in videos
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos. NeurIPS, 2024
work page 2024
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020
work page 2020
-
[6]
Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Making large multimodal models understand arbitrary visual prompts. In CVPR, 2024
work page 2024
-
[7]
End-to-end object detection with trans- formers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with trans- formers. In ECCV, 2020
work page 2020
-
[8]
Pix2video: Video editing using image diffusion
Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mi- tra. Pix2video: Video editing using image diffusion. In ICCV, 2023
work page 2023
-
[9]
Revisiting refer- ring expression comprehension evaluation in the era of large multimodal models
Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting refer- ring expression comprehension evaluation in the era of large multimodal models. arXiv preprint arXiv:2406.16866, 2024
-
[10]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Llava-interactive: An all-in-one demo for image chat, segmenta- tion, generation and editing
Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. Llava-interactive: An all-in-one demo for image chat, segmenta- tion, generation and editing. arXiv preprint arXiv:2311.00571, 2023
-
[12]
Deco: Unleashing the potential of convnets for query-based detection and segmentation
Xinghao Chen, Siwei Li, Yijing Yang, and Yunhe Wang. Deco: Unleashing the potential of convnets for query-based detection and segmentation. In ICLR, 2025
work page 2025
-
[13]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, JiapengLuo, ZhengMa, etal. Howfarareweto gpt-4v? closing the gap to commercial multimodal 14 Figure 5Visualization results on image referring segmentation task. models with open-source suites. arXiv preprint arXiv:2404.16821, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Schwing, Alexander Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked- attention mask transformer for universal image seg- mentation. InCVPR, 2022
work page 2022
-
[16]
Putting the object back into video object segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon- Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, 2024
work page 2024
-
[17]
Xtuner: A toolkit for ef- ficiently fine-tuning llm
XTuner Contributors. Xtuner: A toolkit for ef- ficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023
work page 2023
-
[18]
Mevis: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023
work page 2023
-
[19]
Pvuw 2025 challenge report: Advances in pixel-level un- derstanding of complex videos in the wild
Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, and Philip Torr. Pvuw 2025 challenge report: Advances in pixel-level un- derstanding of complex videos in the wild. InCVPR workshop, 2025
work page 2025
-
[20]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, BinWang, LinkeOuyang, XilinWei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free- form text-image composition and comprehens...
-
[21]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language mode...
work page 2024
-
[22]
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating 15 Please segment the object is a dark-colored backpack with light-colored accents, featuring multiple compartments and pockets, securely fastened to an individual's back. The pe...
work page 2024
-
[23]
Mmbench-video: A long-form multi-shot bench- mark for holistic video understanding
Xinyu Fang, Kangrui Mao, Haodong Duan, Xi- angyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot bench- mark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024
-
[24]
Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing
Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat- Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. InNeurIPS, 2024
work page 2024
-
[25]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xi- awu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Video object segmentation-based visual servo con- trol and object depth estimation on a mobile robot
Brent Griffin, Victoria Florence, and Jason Corso. Video object segmentation-based visual servo con- trol and object depth estimation on a mobile robot. In WACV, 2020
work page 2020
-
[28]
Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024
-
[29]
Tianrui Guan, Divya Kothandaraman, Rohan Chandra, Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, and Dinesh Manocha. Ga-nav: Ef- ficient terrain segmentation for robot navigation in unstructured outdoor environments.RA-L, 2022
work page 2022
-
[30]
They are the main focus of the image, sitting in a canoe and enjoying the serene surroundings
Tianrui Guan, Ruitao Song, Zhixian Ye, and 16 The region marked by region1 is a person wearing a green jacket. They are the main focus of the image, sitting in a canoe and enjoying the serene surroundings. The person is wearing a life jacket, which adds to the safety precautions taken during the canoeing activity. The region marked as region3 is the botto...
work page 2023
-
[31]
Openvis: Open-vocabulary video instance segmen- tation
Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. Openvis: Open-vocabulary video instance segmen- tation. arXiv preprint arXiv:2305.16835, 2023
-
[32]
Free video-llm: Prompt- guided visual perception for efficient training-free video llms
Kai Han, Jianyuan Guo, Yehui Tang, Wei He, En- hua Wu, and Yunhe Wang. Free video-llm: Prompt- guided visual perception for efficient training-free video llms. arXiv preprint arXiv:2410.10441, 2024
-
[33]
Animate-a-story: Storytelling with retrieval- augmented video generation
Yingqing He, Menghan Xia, Haoxin Chen, Xi- aodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval- augmented video generation. arXiv preprint arXiv:2307.06940, 2023
-
[34]
Vtimellm: Empower llm to grasp video moments
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. InCVPR, 2024
work page 2024
-
[35]
Reason3d: Searching and reasoning 3d segmentation via large language model
Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, and Ming-Hsuan Yang. Reason3d: Searching and reasoning 3d segmentation via large language model. In 3DV, 2025
work page 2025
-
[36]
Style-a-video: Agile diffusion for arbitrary text- based video style transfer.SPL, 2024
Nisha Huang, Yuxin Zhang, and Weiming Dong. Style-a-video: Agile diffusion for arbitrary text- based video style transfer.SPL, 2024
work page 2024
-
[37]
Pixel-bert: Aligning image pixels with text by deep multi-modal transformers
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020
-
[38]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019
work page 2019
-
[39]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Louis Martin Hugo Touvron, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Video instance segmentation us- ing inter-frame communication transformers
Sukjun Hwang, Miran Heo, Seoung Wug Oh, and Seon Joo Kim. Video instance segmentation us- ing inter-frame communication transformers. In NeurIPS, 2021
work page 2021
-
[41]
Memory-space visual prompting for efficient vision-language fine-tuning
Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, and Yunhe Wang. Memory-space visual prompting for efficient vision-language fine-tuning. arXiv preprint arXiv:2405.05615, 2024
-
[42]
Sahar Kazemzadeh, Vicente Ordonez, Mark Mat- ten, and Tamara Berg. Referitgame: Referring 17 The object is a silver car, captured from the rear, driving on a busy road at night. Its brake lights are illuminated, suggesting it is either slowing down or stopped. The car's make is visible on the back, and the license plate is clearly displayed. It is among v...
work page 2014
-
[43]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016
work page 2016
-
[44]
Video object segmentation with language referring expressions
Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In ACCV, 2018
work page 2018
-
[45]
Videopoet: A large language model for zero- shot video generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero- shot video generation. InICML, 2024
work page 2024
-
[46]
Lisa: Rea- soning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Rea- soning segmentation via large language model. In CVPR, 2024
work page 2024
-
[47]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy vi- sual task transfer.arXivpreprintarXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Seed- bench: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed- bench: Benchmarking multimodal large language models. In CVPR, 2024
work page 2024
-
[49]
Aria: An open multimodal native mixture-of-experts model
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open mul- 18 timodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024
-
[50]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InICML, 2022
work page 2022
-
[51]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InICML, 2023
work page 2023
-
[52]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video under- standing. arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Univs: Unified and universal video segmen- tation with prompts as queries
Minghan Li, Shuai Li, Xindong Zhang, and Lei Zhang. Univs: Unified and universal video segmen- tation with prompts as queries. InCVPR, 2024
work page 2024
-
[54]
Tube-link: A flexible cross tube baseline for universal video segmentation
Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guan- gliang Cheng, Jiangmiao Pang, and Chen Change Loy. Tube-link: A flexible cross tube baseline for universal video segmentation. InICCV, 2023
work page 2023
-
[55]
Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024
Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, 2024
work page 2024
-
[56]
Llama- vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama- vid: An image is worth 2 tokens in large language models. In ECCV, 2024
work page 2024
-
[57]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023
work page 2023
-
[58]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In EMNLP, 2024
work page 2024
-
[59]
GRES: Generalized referring expression segmen- tation
Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmen- tation. In CVPR, 2023
work page 2023
-
[60]
Lsvos 2025 challenge report: Recent advances in complex video object segmentation
Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, et al. Lsvos 2025 challenge report: Recent advances in complex video object segmentation. ICCV workshop, 2025
work page 2025
-
[61]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023
work page 2023
-
[62]
Improved baselines with visual in- struction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual in- struction tuning. InCVPR, 2024
work page 2024
-
[63]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/
work page 2024
-
[64]
Video-p2p: Video editing with cross- attention control
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross- attention control. InCVPR, 2024
work page 2024
-
[65]
Llava-plus: Learning to use tools for creating multimodal agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InECCV, 2024
work page 2024
-
[66]
Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024
work page 2024
-
[67]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022
work page 2022
-
[68]
Video-chatgpt: Towards detailed video understanding via large vi- sion and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vi- sion and language models. InACL, 2024
work page 2024
-
[69]
Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023
Shehan Munasinghe, Rusiru Thushara, Muham- mad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023
-
[70]
Videoglamm: A large multimodal model for pixel-level visual grounding in videos
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos. arXiv preprint arXiv:2411.04923, 2024
-
[71]
Context-guided spatial feature reconstruction for efficient semantic segmen- tation
Zhenliang Ni, Xinghao Chen, Yingjie Zhai, Yehui Tang, and Yunhe Wang. Context-guided spatial feature reconstruction for efficient semantic segmen- tation. In ECCV, 2024
work page 2024
-
[72]
Prashant W Patil, Akshay Dudhane, Ashutosh Kulkarni, Subrahmanyam Murala, Anil Balaji Gonde, and Sunil Gupta. An unified recurrent video object segmentation framework for various surveillance environments.IEEE TIP, 2021
work page 2021
-
[73]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, 19 Xiangtai Li, Weidong Guo, Yu Xu, and Ming- Hsuan Yang. Generalizable entity grounding via assistance of large language model.arXiv preprint arXiv:2402.02555, 2024
-
[75]
Eve: Efficient multi- modal vision language models with elastic visual experts
Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, and Yunhe Wang. Eve: Efficient multi- modal vision language models with elastic visual experts. arXiv preprint arXiv:2501.04322, 2025
-
[76]
Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. Glamm: Pixel grounding large multimodal model. InCVPR, 2024
work page 2024
-
[77]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos. arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In CVPR, 2024
work page 2024
-
[79]
URVOS: Unified referring video object segmenta- tion network with a large-scale benchmark
Seonguk Seo, Joon-Young Lee, and Bohyung Han. URVOS: Unified referring video object segmenta- tion network with a large-scale benchmark. In ECCV, 2020
work page 2020
-
[80]
Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question an- swering. InCVPR, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.