pith. sign in

arxiv: 2605.18018 · v1 · pith:SX3Y23HVnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.HC

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Pith reviewed 2026-05-20 12:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC
keywords fine-grained object understandingvision-language alignmentcross-attention mapsreferring expressionsmultimodal large language modelsvideo understandingspatial consistency
0
0 comments X

The pith

A training strategy corrects diffuse cross-attention on object nouns so text prompts alone specify precise video objects at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal models show a consistent mismatch where attribute words trigger sharp visual activations but object nouns produce scattered ones due to semantic reference bias. It introduces the NL-Refer dataset pairing masks with natural-language descriptions and applies SWIM to extract multi-layer cross-attention maps from nouns then enforce their match to ground-truth masks. This supervision occurs only in training so the model learns to focus correctly on the described object from text without any visual prompts such as masks or points during use. The result is stronger alignment and higher accuracy on fine-grained object understanding benchmarks than methods that still require explicit visual guidance at test time.

Core claim

SWIM extracts cross-attention maps from object nouns across layers and enforces spatial consistency with ground-truth masks on the NL-Refer dataset during training; this corrects the diffuse patterns caused by semantic reference bias and lets the model automatically attend to the user-specified object from textual prompts alone at inference.

What carries the argument

Multi-layer cross-attention maps from object nouns whose spatial consistency is enforced against ground-truth masks using the NL-Refer dataset.

If this is right

  • Models perform fine-grained object understanding in video using only textual prompts at inference.
  • Performance exceeds that of visual-prompt-based methods on the same benchmarks.
  • Text-visual alignment improves without changing the underlying model architecture.
  • Annotation effort for masks or points is needed only during training, not deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency enforcement could be tested on static images to check whether the noun-attention issue is video-specific.
  • The attention-pattern analysis might guide pretraining objectives that reduce reference bias more broadly.
  • Referring expressions in the dataset could be generated automatically in follow-up work to scale the method.

Load-bearing premise

The assumption that fixing the observed diffuse attention on nouns through mask supervision in training will automatically produce correct object focus from text prompts without masks at inference time.

What would settle it

After SWIM training, cross-attention maps for object nouns remain diffuse and performance on fine-grained benchmarks shows no gain when visual prompts are removed at test time.

Figures

Figures reproduced from arXiv: 2605.18018 by Bowen Yin, Boyuan Sun, Qibin Hou, Xihan Wei, Yuanming Li.

Figure 1
Figure 1. Figure 1: Pattern comparison between classical fine-grained model [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual comparisons of cross-attention maps for object nouns and attribute words between Qwen2.5-VL [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training pipeline of SWIM. Explicitly supervision is applied on cross-attention between object noun and visual tokens, enable accurate fine-grained object grounding from pure neutral text prompts at inference without any extra visual prompt. LLMs [11, 18, 22, 42, 46, 61] to tackle a wide range of tasks [30, 69, 107]. Beyond image-based approaches [40, 44], recent advances in spatiotemporal architectures de… view at source ↗
Figure 4
Figure 4. Figure 4: Scalablity of SWIM. The performance of SWIM scales consistently with the increase in data scale. 4.3.2. Effect of Attention Layer Fusion We further study how attention maps extracted from multi￾ple layers should be fused to provide the alignment signal in SWIM. Several fusion strategies are considered, includ￾ing addition, pooling, mean, and element-wise product. As shown in Tab. 4, simple mean aggregation… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons between SWIM and Qwen2.5-VL [ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Quantitative comparison of fine-grained text–visual [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SWIM, a training strategy to align vision and language representations in multimodal large language models for fine-grained object understanding. It identifies a systematic discrepancy in cross-attention patterns (sharp activations for attributes, diffuse for object nouns), constructs the NL-Refer dataset pairing object masks with precise natural language referring expressions, and applies a multi-layer consistency loss that enforces spatial agreement between noun-derived cross-attention maps and ground-truth masks during training. The goal is to enable the model to attend correctly to user-specified objects from text prompts alone at inference, without visual prompts such as masks or points, while claiming superior performance over visual-prompt-based methods on fine-grained benchmarks.

Significance. If the central mechanism is validated, SWIM could meaningfully improve the usability of MLLMs for fine-grained video object understanding by removing the requirement for explicit visual inputs at test time. The cross-attention discrepancy analysis and the NL-Refer dataset constitute useful contributions that may aid future alignment research. The significance is tempered by the need for stronger evidence that performance gains arise specifically from the learned spatial consistency rather than dataset enrichment or general fine-tuning.

major comments (3)
  1. [Method (§3) and Experiments (§5)] The core assumption that multi-layer spatial consistency supervision on NL-Refer will cause noun-based cross-attention to become localized and correct at mask-free inference is load-bearing yet under-supported. No ablation isolating the consistency loss from the enriched referring expressions is described, nor is there verification (e.g., attention-map comparisons or quantitative localization metrics) that the behavior persists when the mask signal is removed at test time.
  2. [§5] §5 (Experimental results): The claim of substantial improvement in text-visual alignment and superior benchmark performance requires explicit controls. A baseline that fine-tunes on NL-Refer without the consistency term, together with before/after attention visualizations on held-out examples, is needed to attribute gains to the alignment mechanism rather than dataset curation.
  3. [§5] Table or figure in §5: If attention-map results are presented, they should report quantitative measures (e.g., IoU between noun attention and ground-truth masks) on a held-out test set both with and without the mask signal at inference; qualitative examples alone are insufficient to confirm the transfer.
minor comments (2)
  1. [Title and §1] The title specifies 'Video' fine-grained object understanding, yet the abstract and method description do not clarify whether the approach is applied to video sequences (with temporal modeling) or to individual frames; this should be stated explicitly in §1 and §3.
  2. [Abstract] The abstract states that code and data are available at the GitHub link; confirm that the released repository includes the exact NL-Refer construction scripts and the multi-layer consistency loss implementation to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We agree that additional ablations and quantitative evaluations are necessary to strengthen the claims regarding the effectiveness of the spatial consistency loss. Below, we provide point-by-point responses to the major comments and describe the revisions we intend to make.

read point-by-point responses
  1. Referee: [Method (§3) and Experiments (§5)] The core assumption that multi-layer spatial consistency supervision on NL-Refer will cause noun-based cross-attention to become localized and correct at mask-free inference is load-bearing yet under-supported. No ablation isolating the consistency loss from the enriched referring expressions is described, nor is there verification (e.g., attention-map comparisons or quantitative localization metrics) that the behavior persists when the mask signal is removed at test time.

    Authors: We acknowledge that the current manuscript lacks an explicit ablation to isolate the contribution of the consistency loss from the dataset itself. In the revised version, we will add an ablation study that fine-tunes the model on NL-Refer both with and without the multi-layer consistency term. Furthermore, we will include attention-map comparisons and quantitative metrics (such as mean IoU) on held-out examples to verify that the localized attention behavior transfers to mask-free inference. This will help attribute the performance gains specifically to the alignment mechanism. revision: yes

  2. Referee: [§5] §5 (Experimental results): The claim of substantial improvement in text-visual alignment and superior benchmark performance requires explicit controls. A baseline that fine-tunes on NL-Refer without the consistency term, together with before/after attention visualizations on held-out examples, is needed to attribute gains to the alignment mechanism rather than dataset curation.

    Authors: We agree with the need for explicit controls to isolate the effect of the consistency loss. We will incorporate a baseline experiment fine-tuning on NL-Refer without the consistency term and compare it to the full SWIM approach. Additionally, we will add before-and-after attention visualizations on held-out test examples to illustrate the changes in cross-attention patterns induced by the consistency supervision. revision: yes

  3. Referee: [§5] Table or figure in §5: If attention-map results are presented, they should report quantitative measures (e.g., IoU between noun attention and ground-truth masks) on a held-out test set both with and without the mask signal at inference; qualitative examples alone are insufficient to confirm the transfer.

    Authors: We recognize that qualitative examples alone may not suffice to confirm the transfer of localized attention. In the revised manuscript, we will augment the attention-map results with quantitative measures, specifically reporting IoU scores between the noun-derived cross-attention maps and ground-truth masks on a held-out test set. These metrics will be provided for both scenarios: with the mask signal during training (as in the current setup) and at inference without any mask input, to demonstrate the persistence of the alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical training procedure

full rationale

The paper presents SWIM as an empirical training strategy that applies mask-based spatial consistency supervision only during training on the newly constructed NL-Refer dataset to align cross-attention maps extracted from object nouns. The central claims rest on experimental benchmark results rather than any closed-form derivation, equation, or fitted parameter that reduces the reported improvement to its own inputs by construction. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises, and the method remains externally verifiable through standard train-with-supervision / test-without-supervision protocols. This is a standard supervised fine-tuning setup whose performance claims are independent of the inputs they are measured against.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that cross-attention misalignment for object nouns is correctable via mask-guided training and that this correction generalizes to inference without visual prompts. No free parameters or invented entities are evident from the abstract.

axioms (1)
  • domain assumption Cross-attention maps extracted from object nouns in pretrained MLLMs can be made spatially consistent with ground-truth object masks through supervised training.
    Invoked when the paper describes extracting multi-layer cross-attention maps and enforcing spatial consistency with masks.

pith-pipeline@v0.9.0 · 5759 in / 1342 out tokens · 45684 ms · 2026-05-20T12:06:44.008066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 30 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 1

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  3. [3]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 14

  4. [4]

    Mak- ing large multimodal models understand arbitrary visual prompts

    Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Mak- ing large multimodal models understand arbitrary visual prompts. InCVPR, pages 12914–12923, 2024. 3

  5. [5]

    Vip- llava: Making large multimodal models understand arbi- trary visual prompts

    Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip- llava: Making large multimodal models understand arbi- trary visual prompts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12914–12923, 2024. 3

  6. [6]

    Position-enhanced visual instruction tuning for multimodal large language models

    Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023. 3

  7. [7]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 1

  8. [8]

    Sharegpt4video: Improving video understanding and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024. 2

  9. [9]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 5

  10. [10]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 5

  11. [11]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites.arXiv preprint arXiv:2404.16821, 2024. 3

  12. [12]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 2

  14. [14]

    Mevis: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceed- ings of the IEEE/CVF international conference on com- puter vision, pages 2694–2703, 2023. 5

  15. [15]

    Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 5

  16. [16]

    Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025

    Liwei Ding, Kowei Shih, Hairu Wen, Xinshi Li, and Qin Yang. Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025. 2

  17. [17]

    Docopilot: Improving multimodal models for document-level understanding

    Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shen- glong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, et al. Docopilot: Improving multimodal models for document-level understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 4026–4037, 2025. 3

  18. [18]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

  19. [19]

    Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

    Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In NeurIPS, 2024. 3

  20. [20]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5, 6, 14

  21. [21]

    Mme-survey: A comprehensive survey on evaluation of multimodal llms

    Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive sur- vey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296, 2024. 2

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  23. [23]

    Regiongpt: Towards region understanding vision lan- guage model

    Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796– 13806, 2024. 1, 3

  24. [24]

    Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025

    Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Sub- hashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025. 3

  25. [25]

    Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025

    Yusong Hu, Runmin Ma, Yue Fan, Jinxin Shi, Zongsheng Cao, Yuhao Zhou, Jiakang Yuan, Xiangchao Yan, Wenlong Zhang, Lei Bai, et al. Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025. 3

  26. [26]

    Segment and caption anything

    Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng Liu. Segment and caption anything. InCVPR, pages 13405– 13417, 2024. 3

  27. [27]

    Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer

    Nikolai Ilinykh and Simon Dobnik. Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. InFindings of the association for computational linguistics: ACL 2022, pages 4062–4073, 2022. 2

  28. [28]

    Referring to any person

    Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Liu Qin, and Lei Zhang. Referring to any person. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 21667– 21678, 2025. 1

  29. [29]

    Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

    Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. 3

  30. [30]

    Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026

    Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, Ming-Ming Cheng, and Qibin Hou. Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026. 3

  31. [31]

    What’s in the image? a deep-dive into the vision of vision language mod- els

    Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language mod- els. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025. 3

  32. [32]

    Your large vision-language model only needs a few attention heads for visual grounding

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3

  33. [33]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5

  34. [34]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 5

  35. [35]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 5, 6, 14

  36. [36]

    Object attribute matters in visual question answering

    Peize Li, Qingyi Si, Peng Fu, Zheng Lin, and Yan Wang. Object attribute matters in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18545–18553, 2024. 2

  37. [37]

    Tgif: A new dataset and benchmark on animated gif description

    Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016. 1

  38. [38]

    Tempsamp-r1: Effective temporal sampling with reinforcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025

    Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, and Ming-Ming Cheng. Tempsamp-r1: Effective temporal sampling with rein- forcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025. 1

  39. [39]

    Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

    Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 5, 6

  40. [40]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InCVPR, pages 26689–26699, 2024. 3

  41. [41]

    Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

    Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1, 5

  42. [42]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 3

  43. [43]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 2

  44. [44]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024. 3

  45. [45]

    Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On-demand spatial- temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 1

  46. [46]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jian- feng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024. 3

  47. [47]

    Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,

  48. [48]

    ChatGPT, 2023

    OpenAI. ChatGPT, 2023. 1

  49. [49]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024. 2, 5, 6

  50. [50]

    Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning

    Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning. arXiv preprint arXiv:2412.03565, 2024. 1, 2, 5

  51. [51]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 5

  52. [52]

    Beyond semantics: Rediscovering spatial awareness in vision-language models,

    Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2

  53. [53]

    Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024

    Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 5

  54. [54]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2

  55. [55]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In CVPR, pages 13009–13018, 2024. 3

  56. [56]

    Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023

    Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfo- gel, Yoav Goldberg, and Gal Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023. 2

  57. [57]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.174...

  58. [58]

    Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025

    Boyuan Sun, Modi Jin, Bowen Yin, and Qibin Hou. Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025. 3

  59. [59]

    Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025

    Boyuan Sun, Jiaxing Zhao, Xihan Wei, and Qibin Hou. Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025. 3

  60. [60]

    Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems ...

  61. [61]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelli- gence.arXiv preprint arXiv:2507.20534, 2025. 3

  62. [62]

    Qwen2-vl

    Qwen team. Qwen2-vl. 2024. 1, 5

  63. [63]

    Qwen2.5: A party of foundation models,

    Qwen Team. Qwen2.5: A party of foundation models,

  64. [64]

    Chat- terbox: Multi-round multimodal referring and grounding

    Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chat- terbox: Multi-round multimodal referring and grounding. arXiv preprint arXiv:2401.13307, 2024. 3

  65. [65]

    Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024. 3

  66. [66]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1

  67. [67]

    Elysium: Exploring object-level perception in videos via mllm

    Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 5

  68. [68]

    Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

    Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024. 3

  69. [69]

    X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

    Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025. 3

  70. [70]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and ef- ficiency.arXiv preprint arXiv:2508.18265, 2025. 1

  71. [71]

    Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

    Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing- shan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023. 3

  72. [72]

    Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024

    Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024. 3

  73. [73]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747,

  74. [74]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 5

  75. [75]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1

  76. [76]

    Pink: Unveiling the power of referential comprehension for multi-modal llms

    Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential comprehension for multi-modal llms. InCVPR, pages 13838–13848, 2024. 3

  77. [77]

    List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024

    An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024. 3

  78. [78]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

  79. [79]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  80. [80]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3

Showing first 80 references.