See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Pith reviewed 2026-05-20 12:06 UTC · model grok-4.3
The pith
A training strategy corrects diffuse cross-attention on object nouns so text prompts alone specify precise video objects at inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWIM extracts cross-attention maps from object nouns across layers and enforces spatial consistency with ground-truth masks on the NL-Refer dataset during training; this corrects the diffuse patterns caused by semantic reference bias and lets the model automatically attend to the user-specified object from textual prompts alone at inference.
What carries the argument
Multi-layer cross-attention maps from object nouns whose spatial consistency is enforced against ground-truth masks using the NL-Refer dataset.
If this is right
- Models perform fine-grained object understanding in video using only textual prompts at inference.
- Performance exceeds that of visual-prompt-based methods on the same benchmarks.
- Text-visual alignment improves without changing the underlying model architecture.
- Annotation effort for masks or points is needed only during training, not deployment.
Where Pith is reading between the lines
- The same consistency enforcement could be tested on static images to check whether the noun-attention issue is video-specific.
- The attention-pattern analysis might guide pretraining objectives that reduce reference bias more broadly.
- Referring expressions in the dataset could be generated automatically in follow-up work to scale the method.
Load-bearing premise
The assumption that fixing the observed diffuse attention on nouns through mask supervision in training will automatically produce correct object focus from text prompts without masks at inference time.
What would settle it
After SWIM training, cross-attention maps for object nouns remain diffuse and performance on fine-grained benchmarks shows no gain when visual prompts are removed at test time.
Figures
read the original abstract
We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWIM, a training strategy to align vision and language representations in multimodal large language models for fine-grained object understanding. It identifies a systematic discrepancy in cross-attention patterns (sharp activations for attributes, diffuse for object nouns), constructs the NL-Refer dataset pairing object masks with precise natural language referring expressions, and applies a multi-layer consistency loss that enforces spatial agreement between noun-derived cross-attention maps and ground-truth masks during training. The goal is to enable the model to attend correctly to user-specified objects from text prompts alone at inference, without visual prompts such as masks or points, while claiming superior performance over visual-prompt-based methods on fine-grained benchmarks.
Significance. If the central mechanism is validated, SWIM could meaningfully improve the usability of MLLMs for fine-grained video object understanding by removing the requirement for explicit visual inputs at test time. The cross-attention discrepancy analysis and the NL-Refer dataset constitute useful contributions that may aid future alignment research. The significance is tempered by the need for stronger evidence that performance gains arise specifically from the learned spatial consistency rather than dataset enrichment or general fine-tuning.
major comments (3)
- [Method (§3) and Experiments (§5)] The core assumption that multi-layer spatial consistency supervision on NL-Refer will cause noun-based cross-attention to become localized and correct at mask-free inference is load-bearing yet under-supported. No ablation isolating the consistency loss from the enriched referring expressions is described, nor is there verification (e.g., attention-map comparisons or quantitative localization metrics) that the behavior persists when the mask signal is removed at test time.
- [§5] §5 (Experimental results): The claim of substantial improvement in text-visual alignment and superior benchmark performance requires explicit controls. A baseline that fine-tunes on NL-Refer without the consistency term, together with before/after attention visualizations on held-out examples, is needed to attribute gains to the alignment mechanism rather than dataset curation.
- [§5] Table or figure in §5: If attention-map results are presented, they should report quantitative measures (e.g., IoU between noun attention and ground-truth masks) on a held-out test set both with and without the mask signal at inference; qualitative examples alone are insufficient to confirm the transfer.
minor comments (2)
- [Title and §1] The title specifies 'Video' fine-grained object understanding, yet the abstract and method description do not clarify whether the approach is applied to video sequences (with temporal modeling) or to individual frames; this should be stated explicitly in §1 and §3.
- [Abstract] The abstract states that code and data are available at the GitHub link; confirm that the released repository includes the exact NL-Refer construction scripts and the multi-layer consistency loss implementation to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We agree that additional ablations and quantitative evaluations are necessary to strengthen the claims regarding the effectiveness of the spatial consistency loss. Below, we provide point-by-point responses to the major comments and describe the revisions we intend to make.
read point-by-point responses
-
Referee: [Method (§3) and Experiments (§5)] The core assumption that multi-layer spatial consistency supervision on NL-Refer will cause noun-based cross-attention to become localized and correct at mask-free inference is load-bearing yet under-supported. No ablation isolating the consistency loss from the enriched referring expressions is described, nor is there verification (e.g., attention-map comparisons or quantitative localization metrics) that the behavior persists when the mask signal is removed at test time.
Authors: We acknowledge that the current manuscript lacks an explicit ablation to isolate the contribution of the consistency loss from the dataset itself. In the revised version, we will add an ablation study that fine-tunes the model on NL-Refer both with and without the multi-layer consistency term. Furthermore, we will include attention-map comparisons and quantitative metrics (such as mean IoU) on held-out examples to verify that the localized attention behavior transfers to mask-free inference. This will help attribute the performance gains specifically to the alignment mechanism. revision: yes
-
Referee: [§5] §5 (Experimental results): The claim of substantial improvement in text-visual alignment and superior benchmark performance requires explicit controls. A baseline that fine-tunes on NL-Refer without the consistency term, together with before/after attention visualizations on held-out examples, is needed to attribute gains to the alignment mechanism rather than dataset curation.
Authors: We agree with the need for explicit controls to isolate the effect of the consistency loss. We will incorporate a baseline experiment fine-tuning on NL-Refer without the consistency term and compare it to the full SWIM approach. Additionally, we will add before-and-after attention visualizations on held-out test examples to illustrate the changes in cross-attention patterns induced by the consistency supervision. revision: yes
-
Referee: [§5] Table or figure in §5: If attention-map results are presented, they should report quantitative measures (e.g., IoU between noun attention and ground-truth masks) on a held-out test set both with and without the mask signal at inference; qualitative examples alone are insufficient to confirm the transfer.
Authors: We recognize that qualitative examples alone may not suffice to confirm the transfer of localized attention. In the revised manuscript, we will augment the attention-map results with quantitative measures, specifically reporting IoU scores between the noun-derived cross-attention maps and ground-truth masks on a held-out test set. These metrics will be provided for both scenarios: with the mask signal during training (as in the current setup) and at inference without any mask input, to demonstrate the persistence of the alignment. revision: yes
Circularity Check
No significant circularity in empirical training procedure
full rationale
The paper presents SWIM as an empirical training strategy that applies mask-based spatial consistency supervision only during training on the newly constructed NL-Refer dataset to align cross-attention maps extracted from object nouns. The central claims rest on experimental benchmark results rather than any closed-form derivation, equation, or fitted parameter that reduces the reported improvement to its own inputs by construction. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises, and the method remains externally verifiable through standard train-with-supervision / test-without-supervision protocols. This is a standard supervised fine-tuning setup whose performance claims are independent of the inputs they are measured against.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cross-attention maps extracted from object nouns in pretrained MLLMs can be made spatially consistent with ground-truth object masks through supervised training.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks... L(i)_BCE = −1/HW ∑ [M log Ā + (1−M) log(1−Ā)]
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Attribute words produce sharp, localized activations... object nouns yield diffuse and scattered patterns
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 14
work page 2015
-
[4]
Mak- ing large multimodal models understand arbitrary visual prompts
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Mak- ing large multimodal models understand arbitrary visual prompts. InCVPR, pages 12914–12923, 2024. 3
work page 2024
-
[5]
Vip- llava: Making large multimodal models understand arbi- trary visual prompts
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip- llava: Making large multimodal models understand arbi- trary visual prompts. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12914–12923, 2024. 3
work page 2024
-
[6]
Position-enhanced visual instruction tuning for multimodal large language models
Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023. 3
-
[7]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Sharegpt4video: Improving video understanding and generation with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024. 2
-
[9]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 5
work page 2024
-
[10]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites.arXiv preprint arXiv:2404.16821, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Mevis: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceed- ings of the IEEE/CVF international conference on com- puter vision, pages 2694–2703, 2023. 5
work page 2023
-
[15]
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 5
work page 2025
-
[16]
Liwei Ding, Kowei Shih, Hairu Wen, Xinshi Li, and Qin Yang. Cross-attention transformer-based visual-language fusion for multimodal image analysis.International Jour- nal of Applied Science, 8(1):p27–p27, 2025. 2
work page 2025
-
[17]
Docopilot: Improving multimodal models for document-level understanding
Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shen- glong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, et al. Docopilot: Improving multimodal models for document-level understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 4026–4037, 2025. 3
work page 2025
-
[18]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,
-
[19]
Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing
Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In NeurIPS, 2024. 3
work page 2024
-
[20]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5, 6, 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Mme-survey: A comprehensive survey on evaluation of multimodal llms
Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive sur- vey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296, 2024. 2
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Regiongpt: Towards region understanding vision lan- guage model
Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796– 13806, 2024. 1, 3
work page 2024
-
[24]
Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Sub- hashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks.arXiv preprint arXiv:2501.08326, 2025. 3
-
[25]
Yusong Hu, Runmin Ma, Yue Fan, Jinxin Shi, Zongsheng Cao, Yuhao Zhou, Jiakang Yuan, Xiangchao Yan, Wenlong Zhang, Lei Bai, et al. Flowsearch: Advancing deep research with dynamic structured knowledge flow.arXiv preprint arXiv:2510.08521, 2025. 3
-
[26]
Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng Liu. Segment and caption anything. InCVPR, pages 13405– 13417, 2024. 3
work page 2024
-
[27]
Nikolai Ilinykh and Simon Dobnik. Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. InFindings of the association for computational linguistics: ACL 2022, pages 4062–4073, 2022. 2
work page 2022
-
[28]
Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Liu Qin, and Lei Zhang. Referring to any person. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 21667– 21678, 2025. 1
work page 2025
-
[29]
Zhangqi Jiang, Junkai Chen, Beier Zhu, Tingjin Luo, Yankun Shen, and Xu Yang. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 25004–25014, 2025. 3
work page 2025
-
[30]
Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, Ming-Ming Cheng, and Qibin Hou. Geoagent: Learning to geolocate everywhere with reinforced geographic char- acteristics.arXiv preprint arXiv:2602.12617, 2026. 3
-
[31]
What’s in the image? a deep-dive into the vision of vision language mod- els
Omri Kaduri, Shai Bagon, and Tali Dekel. What’s in the image? a deep-dive into the vision of vision language mod- els. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14549–14558, 2025. 3
work page 2025
-
[32]
Your large vision-language model only needs a few attention heads for visual grounding
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025. 3
work page 2025
-
[33]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Mvbench: A comprehensive multi-modal video under- standing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 5, 6, 14
work page 2024
-
[36]
Object attribute matters in visual question answering
Peize Li, Qingyi Si, Peng Fu, Zheng Lin, and Yan Wang. Object attribute matters in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 18545–18553, 2024. 2
work page 2024
-
[37]
Tgif: A new dataset and benchmark on animated gif description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016. 1
work page 2016
-
[38]
Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, and Ming-Ming Cheng. Tempsamp-r1: Effective temporal sampling with rein- forcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025. 1
-
[39]
Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025
Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Dar- rell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 5, 6
-
[40]
Vila: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InCVPR, pages 26689–26699, 2024. 3
work page 2024
-
[41]
Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1, 5
-
[42]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023. 2
work page 2023
-
[44]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024. 3
work page 2024
-
[45]
Oryx MLLM: On- Demand Spatial-Temporal Understanding at Arbi- trary Resolution
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On-demand spatial- temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024. 1
-
[46]
Large Language Models: A Survey
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jian- feng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video- based large language models.Computational Visual Media,
- [48]
- [49]
-
[50]
Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning
Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Inst-it: Boosting multimodal instance understanding via explicit visual prompt instruction tuning. arXiv preprint arXiv:2412.03565, 2024. 1, 2, 5
-
[51]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[52]
Beyond semantics: Rediscovering spatial awareness in vision-language models,
Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2
-
[53]
Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, and Yunjie Tian. Artemis: Towards referential understanding in com- plex videos.Advances in Neural Information Processing Systems, 37:114321–114347, 2024. 5
work page 2024
-
[54]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2
work page 2021
-
[55]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. In CVPR, pages 13009–13018, 2024. 3
work page 2024
-
[56]
Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfo- gel, Yoav Goldberg, and Gal Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment.Advances in Neural In- formation Processing Systems, 36:3536–3559, 2023. 2
work page 2023
-
[57]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv:2410.174...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025
Boyuan Sun, Modi Jin, Bowen Yin, and Qibin Hou. Depth anything at any condition.arXiv preprint arXiv:2507.01634, 2025. 3
-
[59]
Boyuan Sun, Jiaxing Zhao, Xihan Wei, and Qibin Hou. Llava-scissor: Token compression with semantic con- nected components for video llms.arXiv preprint arXiv:2506.21862, 2025. 3
-
[60]
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understand- ing with large language models: A survey.IEEE Transac- tions on Circuits and Systems ...
work page 2025
-
[61]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelli- gence.arXiv preprint arXiv:2507.20534, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [62]
- [63]
-
[64]
Chat- terbox: Multi-round multimodal referring and grounding
Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chat- terbox: Multi-round multimodal referring and grounding. arXiv preprint arXiv:2401.13307, 2024. 3
-
[65]
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms.Advances in Neural Informa- tion Processing Systems, 37:87310–87356, 2024. 3
work page 2024
-
[66]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1
work page 2024
-
[67]
Elysium: Exploring object-level perception in videos via mllm
Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 5
work page 2024
-
[68]
Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024
Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, and Zhaoxiang Zhang. Reconstructive visual instruction tuning.arXiv preprint arXiv:2410.09575, 2024. 3
-
[69]
X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025
Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation.arXiv preprint arXiv:2508.04655, 2025. 3
-
[70]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and ef- ficiency.arXiv preprint arXiv:2508.18265, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023
Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing- shan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023. 3
-
[72]
Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024
Yuxuan Wang, Cihang Xie, Yang Liu, and Zilong Zheng. Videollamb: Long video understanding with recurrent memory bridges.arxiv, 2024. 3
work page 2024
-
[73]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Pink: Unveiling the power of referential comprehension for multi-modal llms
Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. Pink: Unveiling the power of referential comprehension for multi-modal llms. InCVPR, pages 13838–13848, 2024. 3
work page 2024
-
[77]
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, et al. List items one by one: A new data source and learning paradigm for multimodal llms.arXiv preprint arXiv:2404.16375, 2024. 3
-
[78]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.