See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Bowen Yin; Boyuan Sun; Qibin Hou; Xihan Wei; Yuanming Li

arxiv: 2605.18018 · v1 · pith:SX3Y23HVnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.HC

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Boyuan Sun , Bowen Yin , Yuanming Li , Xihan Wei , Qibin Hou This is my paper

Pith reviewed 2026-05-20 12:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC

keywords fine-grained object understandingvision-language alignmentcross-attention mapsreferring expressionsmultimodal large language modelsvideo understandingspatial consistency

0 comments

The pith

A training strategy corrects diffuse cross-attention on object nouns so text prompts alone specify precise video objects at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal models show a consistent mismatch where attribute words trigger sharp visual activations but object nouns produce scattered ones due to semantic reference bias. It introduces the NL-Refer dataset pairing masks with natural-language descriptions and applies SWIM to extract multi-layer cross-attention maps from nouns then enforce their match to ground-truth masks. This supervision occurs only in training so the model learns to focus correctly on the described object from text without any visual prompts such as masks or points during use. The result is stronger alignment and higher accuracy on fine-grained object understanding benchmarks than methods that still require explicit visual guidance at test time.

Core claim

SWIM extracts cross-attention maps from object nouns across layers and enforces spatial consistency with ground-truth masks on the NL-Refer dataset during training; this corrects the diffuse patterns caused by semantic reference bias and lets the model automatically attend to the user-specified object from textual prompts alone at inference.

What carries the argument

Multi-layer cross-attention maps from object nouns whose spatial consistency is enforced against ground-truth masks using the NL-Refer dataset.

If this is right

Models perform fine-grained object understanding in video using only textual prompts at inference.
Performance exceeds that of visual-prompt-based methods on the same benchmarks.
Text-visual alignment improves without changing the underlying model architecture.
Annotation effort for masks or points is needed only during training, not deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency enforcement could be tested on static images to check whether the noun-attention issue is video-specific.
The attention-pattern analysis might guide pretraining objectives that reduce reference bias more broadly.
Referring expressions in the dataset could be generated automatically in follow-up work to scale the method.

Load-bearing premise

The assumption that fixing the observed diffuse attention on nouns through mask supervision in training will automatically produce correct object focus from text prompts without masks at inference time.

What would settle it

After SWIM training, cross-attention maps for object nouns remain diffuse and performance on fine-grained benchmarks shows no gain when visual prompts are removed at test time.

Figures

Figures reproduced from arXiv: 2605.18018 by Bowen Yin, Boyuan Sun, Qibin Hou, Xihan Wei, Yuanming Li.

**Figure 2.** Figure 2: Visual comparisons of cross-attention maps for object nouns and attribute words between Qwen2.5-VL [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Training pipeline of SWIM. Explicitly supervision is applied on cross-attention between object noun and visual tokens, enable accurate fine-grained object grounding from pure neutral text prompts at inference without any extra visual prompt. LLMs [11, 18, 22, 42, 46, 61] to tackle a wide range of tasks [30, 69, 107]. Beyond image-based approaches [40, 44], recent advances in spatiotemporal architectures de… view at source ↗

**Figure 4.** Figure 4: Scalablity of SWIM. The performance of SWIM scales consistently with the increase in data scale. 4.3.2. Effect of Attention Layer Fusion We further study how attention maps extracted from multiple layers should be fused to provide the alignment signal in SWIM. Several fusion strategies are considered, including addition, pooling, mean, and element-wise product. As shown in Tab. 4, simple mean aggregation… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons between SWIM and Qwen2.5-VL [ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative comparison of fine-grained text–visual [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWIM trains MLLMs with mask consistency on a new NL-Refer dataset so text alone can drive fine-grained object attention at inference, but the gains may trace more to data than to the claimed alignment fix.

read the letter

The main point is that this paper gives a concrete training procedure to drop visual prompts at test time for video object referral. They observe that pretrained models activate sharply on attributes but diffusely on nouns, build NL-Refer with natural-language expressions tied to masks, and add a multi-layer consistency loss that pulls noun attention toward the mask during training. At inference the model runs on text only and reportedly beats prompt-based baselines on fine-grained benchmarks. Code and data are released, which helps anyone who wants to check the implementation or extend the dataset.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SWIM, a training strategy to align vision and language representations in multimodal large language models for fine-grained object understanding. It identifies a systematic discrepancy in cross-attention patterns (sharp activations for attributes, diffuse for object nouns), constructs the NL-Refer dataset pairing object masks with precise natural language referring expressions, and applies a multi-layer consistency loss that enforces spatial agreement between noun-derived cross-attention maps and ground-truth masks during training. The goal is to enable the model to attend correctly to user-specified objects from text prompts alone at inference, without visual prompts such as masks or points, while claiming superior performance over visual-prompt-based methods on fine-grained benchmarks.

Significance. If the central mechanism is validated, SWIM could meaningfully improve the usability of MLLMs for fine-grained video object understanding by removing the requirement for explicit visual inputs at test time. The cross-attention discrepancy analysis and the NL-Refer dataset constitute useful contributions that may aid future alignment research. The significance is tempered by the need for stronger evidence that performance gains arise specifically from the learned spatial consistency rather than dataset enrichment or general fine-tuning.

major comments (3)

[Method (§3) and Experiments (§5)] The core assumption that multi-layer spatial consistency supervision on NL-Refer will cause noun-based cross-attention to become localized and correct at mask-free inference is load-bearing yet under-supported. No ablation isolating the consistency loss from the enriched referring expressions is described, nor is there verification (e.g., attention-map comparisons or quantitative localization metrics) that the behavior persists when the mask signal is removed at test time.
[§5] §5 (Experimental results): The claim of substantial improvement in text-visual alignment and superior benchmark performance requires explicit controls. A baseline that fine-tunes on NL-Refer without the consistency term, together with before/after attention visualizations on held-out examples, is needed to attribute gains to the alignment mechanism rather than dataset curation.
[§5] Table or figure in §5: If attention-map results are presented, they should report quantitative measures (e.g., IoU between noun attention and ground-truth masks) on a held-out test set both with and without the mask signal at inference; qualitative examples alone are insufficient to confirm the transfer.

minor comments (2)

[Title and §1] The title specifies 'Video' fine-grained object understanding, yet the abstract and method description do not clarify whether the approach is applied to video sequences (with temporal modeling) or to individual frames; this should be stated explicitly in §1 and §3.
[Abstract] The abstract states that code and data are available at the GitHub link; confirm that the released repository includes the exact NL-Refer construction scripts and the multi-layer consistency loss implementation to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We agree that additional ablations and quantitative evaluations are necessary to strengthen the claims regarding the effectiveness of the spatial consistency loss. Below, we provide point-by-point responses to the major comments and describe the revisions we intend to make.

read point-by-point responses

Referee: [Method (§3) and Experiments (§5)] The core assumption that multi-layer spatial consistency supervision on NL-Refer will cause noun-based cross-attention to become localized and correct at mask-free inference is load-bearing yet under-supported. No ablation isolating the consistency loss from the enriched referring expressions is described, nor is there verification (e.g., attention-map comparisons or quantitative localization metrics) that the behavior persists when the mask signal is removed at test time.

Authors: We acknowledge that the current manuscript lacks an explicit ablation to isolate the contribution of the consistency loss from the dataset itself. In the revised version, we will add an ablation study that fine-tunes the model on NL-Refer both with and without the multi-layer consistency term. Furthermore, we will include attention-map comparisons and quantitative metrics (such as mean IoU) on held-out examples to verify that the localized attention behavior transfers to mask-free inference. This will help attribute the performance gains specifically to the alignment mechanism. revision: yes
Referee: [§5] §5 (Experimental results): The claim of substantial improvement in text-visual alignment and superior benchmark performance requires explicit controls. A baseline that fine-tunes on NL-Refer without the consistency term, together with before/after attention visualizations on held-out examples, is needed to attribute gains to the alignment mechanism rather than dataset curation.

Authors: We agree with the need for explicit controls to isolate the effect of the consistency loss. We will incorporate a baseline experiment fine-tuning on NL-Refer without the consistency term and compare it to the full SWIM approach. Additionally, we will add before-and-after attention visualizations on held-out test examples to illustrate the changes in cross-attention patterns induced by the consistency supervision. revision: yes
Referee: [§5] Table or figure in §5: If attention-map results are presented, they should report quantitative measures (e.g., IoU between noun attention and ground-truth masks) on a held-out test set both with and without the mask signal at inference; qualitative examples alone are insufficient to confirm the transfer.

Authors: We recognize that qualitative examples alone may not suffice to confirm the transfer of localized attention. In the revised manuscript, we will augment the attention-map results with quantitative measures, specifically reporting IoU scores between the noun-derived cross-attention maps and ground-truth masks on a held-out test set. These metrics will be provided for both scenarios: with the mask signal during training (as in the current setup) and at inference without any mask input, to demonstrate the persistence of the alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical training procedure

full rationale

The paper presents SWIM as an empirical training strategy that applies mask-based spatial consistency supervision only during training on the newly constructed NL-Refer dataset to align cross-attention maps extracted from object nouns. The central claims rest on experimental benchmark results rather than any closed-form derivation, equation, or fitted parameter that reduces the reported improvement to its own inputs by construction. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises, and the method remains externally verifiable through standard train-with-supervision / test-without-supervision protocols. This is a standard supervised fine-tuning setup whose performance claims are independent of the inputs they are measured against.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that cross-attention misalignment for object nouns is correctable via mask-guided training and that this correction generalizes to inference without visual prompts. No free parameters or invented entities are evident from the abstract.

axioms (1)

domain assumption Cross-attention maps extracted from object nouns in pretrained MLLMs can be made spatially consistent with ground-truth object masks through supervised training.
Invoked when the paper describes extracting multi-layer cross-attention maps and enforcing spatial consistency with masks.

pith-pipeline@v0.9.0 · 5759 in / 1342 out tokens · 45684 ms · 2026-05-20T12:06:44.008066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks... L(i)_BCE = −1/HW ∑ [M log Ā + (1−M) log(1−Ā)]
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attribute words produce sharp, localized activations... object nouns yield diffuse and scattered patterns

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 30 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 14

work page 2015
[4]

Mak- ing large multimodal models understand arbitrary visual prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Mak- ing large multimodal models understand arbitrary visual prompts. InCVPR, pages 12914–12923, 2024. 3

work page 2024
[5]

Vip- llava: Making large multimodal models understand arbi- trary visual prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip- llava: Making large multimodal models understand arbi- trary visual prompts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12914–12923, 2024. 3

work page 2024
[6]

Position-enhanced visual instruction tuning for multimodal large language models

Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023. 3

work page arXiv 2023
[7]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024. 2

work page arXiv 2024
[9]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 5

work page 2024
[10]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites.arXiv preprint arXiv:2404.16821, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceed- ings of the IEEE/CVF international conference on com- puter vision, pages 2694–2703, 2023. 5

work page 2023
[15]

Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 5

work page 2025
[16]

Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025

Liwei Ding, Kowei Shih, Hairu Wen, Xinshi Li, and Qin Yang. Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025. 2

work page 2025
[17]

Docopilot: Improving multimodal models for document-level understanding

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shen- glong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, et al. Docopilot: Improving multimodal models for document-level understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 4026–4037, 2025. 3

work page 2025
[18]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page
[19]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In NeurIPS, 2024. 3

work page 2024
[20]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5, 6, 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive sur- vey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296, 2024. 2

work page arXiv 2024
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Regiongpt: Towards region understanding vision lan- guage model

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796– 13806, 2024. 1, 3

work page 2024
[24]

Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025

Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Sub- hashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025. 3

work page arXiv 2025
[25]

Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025

Yusong Hu, Runmin Ma, Yue Fan, Jinxin Shi, Zongsheng Cao, Yuhao Zhou, Jiakang Yuan, Xiangchao Yan, Wenlong Zhang, Lei Bai, et al. Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025. 3

work page arXiv 2025
[26]

Segment and caption anything

Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng Liu. Segment and caption anything. InCVPR, pages 13405– 13417, 2024. 3

work page 2024
[27]

Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer

Nikolai Ilinykh and Simon Dobnik. Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. InFindings of the association for computational linguistics: ACL 2022, pages 4062–4073, 2022. 2

work page 2022
[28]

Referring to any person

Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Liu Qin, and Lei Zhang. Referring to any person. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 21667– 21678, 2025. 1

work page 2025
[29]

Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. 3

work page 2025
[30]

Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, Ming-Ming Cheng, and Qibin Hou. Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026. 3

work page arXiv 2026
[31]

What’s in the image? a deep-dive into the vision of vision language mod- els

Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language mod- els. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025. 3

work page 2025
[32]

Your large vision-language model only needs a few attention heads for visual grounding

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3

work page 2025
[33]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 5, 6, 14

work page 2024
[36]

Object attribute matters in visual question answering

Peize Li, Qingyi Si, Peng Fu, Zheng Lin, and Yan Wang. Object attribute matters in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18545–18553, 2024. 2

work page 2024
[37]

Tgif: A new dataset and benchmark on animated gif description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016. 1

work page 2016
[38]

Tempsamp-r1: Effective temporal sampling with reinforcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025

Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, and Ming-Ming Cheng. Tempsamp-r1: Effective temporal sampling with rein- forcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025. 1

work page arXiv 2025
[39]

Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 5, 6

work page arXiv 2025
[40]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InCVPR, pages 26689–26699, 2024. 3

work page 2024
[41]

Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1, 5

work page arXiv 2025
[42]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 2

work page 2023
[44]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024. 3

work page 2024
[45]

Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On-demand spatial- temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 1

work page arXiv 2024
[46]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jian- feng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,

work page
[48]

ChatGPT, 2023

OpenAI. ChatGPT, 2023. 1

work page 2023
[49]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. 2, 5, 6

work page 2024
[50]

Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning. arXiv preprint arXiv:2412.03565, 2024. 1, 2, 5

work page arXiv 2024
[51]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Beyond semantics: Rediscovering spatial awareness in vision-language models,

Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2

work page arXiv 2025
[53]

Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024

Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 5

work page 2024
[54]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021
[55]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In CVPR, pages 13009–13018, 2024. 3

work page 2024
[56]

Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023

Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfo- gel, Yoav Goldberg, and Gal Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023. 2

work page 2023
[57]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.174...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025

Boyuan Sun, Modi Jin, Bowen Yin, and Qibin Hou. Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025. 3

work page arXiv 2025
[59]

Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025

Boyuan Sun, Jiaxing Zhao, Xihan Wei, and Qibin Hou. Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025. 3

work page arXiv 2025
[60]

Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems ...

work page 2025
[61]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelli- gence.arXiv preprint arXiv:2507.20534, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Qwen2-vl

Qwen team. Qwen2-vl. 2024. 1, 5

work page 2024
[63]

Qwen2.5: A party of foundation models,

Qwen Team. Qwen2.5: A party of foundation models,

work page
[64]

Chat- terbox: Multi-round multimodal referring and grounding

Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chat- terbox: Multi-round multimodal referring and grounding. arXiv preprint arXiv:2401.13307, 2024. 3

work page arXiv 2024
[65]

Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024. 3

work page 2024
[66]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1

work page 2024
[67]

Elysium: Exploring object-level perception in videos via mllm

Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 5

work page 2024
[68]

Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024. 3

work page arXiv 2024
[69]

X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025. 3

work page arXiv 2025
[70]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and ef- ficiency.arXiv preprint arXiv:2508.18265, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing- shan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023. 3

work page arXiv 2023
[72]

Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024

Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024. 3

work page 2024
[73]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747,

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Pink: Unveiling the power of referential comprehension for multi-modal llms

Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential comprehension for multi-modal llms. InCVPR, pages 13838–13848, 2024. 3

work page 2024
[77]

List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024

An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024. 3

work page arXiv 2024
[78]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 14

work page 2015

[4] [4]

Mak- ing large multimodal models understand arbitrary visual prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Mak- ing large multimodal models understand arbitrary visual prompts. InCVPR, pages 12914–12923, 2024. 3

work page 2024

[5] [5]

Vip- llava: Making large multimodal models understand arbi- trary visual prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip- llava: Making large multimodal models understand arbi- trary visual prompts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12914–12923, 2024. 3

work page 2024

[6] [6]

Position-enhanced visual instruction tuning for multimodal large language models

Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023. 3

work page arXiv 2023

[7] [7]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024. 2

work page arXiv 2024

[9] [9]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 5

work page 2024

[10] [10]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites.arXiv preprint arXiv:2404.16821, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceed- ings of the IEEE/CVF international conference on com- puter vision, pages 2694–2703, 2023. 5

work page 2023

[15] [15]

Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 5

work page 2025

[16] [16]

Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025

Liwei Ding, Kowei Shih, Hairu Wen, Xinshi Li, and Qin Yang. Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025. 2

work page 2025

[17] [17]

Docopilot: Improving multimodal models for document-level understanding

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shen- glong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, et al. Docopilot: Improving multimodal models for document-level understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 4026–4037, 2025. 3

work page 2025

[18] [18]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page

[19] [19]

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In NeurIPS, 2024. 3

work page 2024

[20] [20]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5, 6, 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive sur- vey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296, 2024. 2

work page arXiv 2024

[22] [22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Regiongpt: Towards region understanding vision lan- guage model

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796– 13806, 2024. 1, 3

work page 2024

[24] [24]

Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025

Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Sub- hashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025. 3

work page arXiv 2025

[25] [25]

Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025

Yusong Hu, Runmin Ma, Yue Fan, Jinxin Shi, Zongsheng Cao, Yuhao Zhou, Jiakang Yuan, Xiangchao Yan, Wenlong Zhang, Lei Bai, et al. Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025. 3

work page arXiv 2025

[26] [26]

Segment and caption anything

Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng Liu. Segment and caption anything. InCVPR, pages 13405– 13417, 2024. 3

work page 2024

[27] [27]

Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer

Nikolai Ilinykh and Simon Dobnik. Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. InFindings of the association for computational linguistics: ACL 2022, pages 4062–4073, 2022. 2

work page 2022

[28] [28]

Referring to any person

Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Liu Qin, and Lei Zhang. Referring to any person. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 21667– 21678, 2025. 1

work page 2025

[29] [29]

Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens

Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. 3

work page 2025

[30] [30]

Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, Ming-Ming Cheng, and Qibin Hou. Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026. 3

work page arXiv 2026

[31] [31]

What’s in the image? a deep-dive into the vision of vision language mod- els

Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language mod- els. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025. 3

work page 2025

[32] [32]

Your large vision-language model only needs a few attention heads for visual grounding

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3

work page 2025

[33] [33]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 5, 6, 14

work page 2024

[36] [36]

Object attribute matters in visual question answering

Peize Li, Qingyi Si, Peng Fu, Zheng Lin, and Yan Wang. Object attribute matters in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18545–18553, 2024. 2

work page 2024

[37] [37]

Tgif: A new dataset and benchmark on animated gif description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016. 1

work page 2016

[38] [38]

Tempsamp-r1: Effective temporal sampling with reinforcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025

Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, and Ming-Ming Cheng. Tempsamp-r1: Effective temporal sampling with rein- forcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025. 1

work page arXiv 2025

[39] [39]

Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 5, 6

work page arXiv 2025

[40] [40]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InCVPR, pages 26689–26699, 2024. 3

work page 2024

[41] [41]

Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1, 5

work page arXiv 2025

[42] [42]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 2

work page 2023

[44] [44]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024. 3

work page 2024

[45] [45]

Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On-demand spatial- temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 1

work page arXiv 2024

[46] [46]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jian- feng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,

work page

[48] [48]

ChatGPT, 2023

OpenAI. ChatGPT, 2023. 1

work page 2023

[49] [49]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. 2, 5, 6

work page 2024

[50] [50]

Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning. arXiv preprint arXiv:2412.03565, 2024. 1, 2, 5

work page arXiv 2024

[51] [51]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

Beyond semantics: Rediscovering spatial awareness in vision-language models,

Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2

work page arXiv 2025

[53] [53]

Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024

Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 5

work page 2024

[54] [54]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2

work page 2021

[55] [55]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In CVPR, pages 13009–13018, 2024. 3

work page 2024

[56] [56]

Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023

Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfo- gel, Yoav Goldberg, and Gal Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023. 2

work page 2023

[57] [57]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.174...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025

Boyuan Sun, Modi Jin, Bowen Yin, and Qibin Hou. Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025. 3

work page arXiv 2025

[59] [59]

Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025

Boyuan Sun, Jiaxing Zhao, Xihan Wei, and Qibin Hou. Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025. 3

work page arXiv 2025

[60] [60]

Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems ...

work page 2025

[61] [61]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelli- gence.arXiv preprint arXiv:2507.20534, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Qwen2-vl

Qwen team. Qwen2-vl. 2024. 1, 5

work page 2024

[63] [63]

Qwen2.5: A party of foundation models,

Qwen Team. Qwen2.5: A party of foundation models,

work page

[64] [64]

Chat- terbox: Multi-round multimodal referring and grounding

Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chat- terbox: Multi-round multimodal referring and grounding. arXiv preprint arXiv:2401.13307, 2024. 3

work page arXiv 2024

[65] [65]

Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024. 3

work page 2024

[66] [66]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1

work page 2024

[67] [67]

Elysium: Exploring object-level perception in videos via mllm

Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 5

work page 2024

[68] [68]

Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024

Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024. 3

work page arXiv 2024

[69] [69]

X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025. 3

work page arXiv 2025

[70] [70]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and ef- ficiency.arXiv preprint arXiv:2508.18265, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing- shan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023. 3

work page arXiv 2023

[72] [72]

Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024

Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024. 3

work page 2024

[73] [73]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747,

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Pink: Unveiling the power of referential comprehension for multi-modal llms

Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential comprehension for multi-modal llms. InCVPR, pages 13838–13848, 2024. 3

work page 2024

[77] [77]

List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024

An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024. 3

work page arXiv 2024

[78] [78]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023