{"work":{"id":"07461eec-156c-4054-a28e-b84bc53bf6e1","openalex_id":null,"doi":null,"arxiv_id":"2305.06355","raw_key":null,"title":"VideoChat: Chat-Centric Video Understanding","authors":null,"authors_text":"KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo","year":2023,"venue":"cs.CV","abstract":"In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything","external_url":"https://arxiv.org/abs/2305.06355","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:05:30.614030+00:00","pith_arxiv_id":"2305.06355","created_at":"2026-05-10T06:56:47.763323+00:00","updated_at":"2026-05-25T08:05:30.614030+00:00","title_quality_ok":true,"display_title":"VideoChat: Chat-Centric Video Understanding","render_title":"VideoChat: Chat-Centric Video Understanding"},"hub":{"state":{"work_id":"07461eec-156c-4054-a28e-b84bc53bf6e1","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":74,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2023-05-05T17:59:46+00:00","last_pith_cited_at":"2026-05-22T04:19:29+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T16:07:44.676546+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":16},{"context_role":"baseline","n":7},{"context_role":"dataset","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":16},{"context_polarity":"baseline","n":7},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-18T03:50:43.223232+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":24},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":20},{"title":"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding","work_id":"555cf04a-49a7-44b8-9019-a83ce85ace95","shared_citers":20},{"title":"Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models","work_id":"51f627f4-8fae-4882-a3e9-abdf932ef27b","shared_citers":19},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":17},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":16},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":16},{"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","work_id":"e2121c51-a55e-476a-af81-7ba6970fe6cf","shared_citers":16},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":15},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":14},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":14},{"title":"MMBench: Is Your Multi-modal Model an All-around Player?","work_id":"3b44943d-0f15-4228-9ac3-0e376f4f9ada","shared_citers":14},{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","shared_citers":14},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":14},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":13},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":13},{"title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","work_id":"7f3bac41-a0a5-4a7a-bfd2-526b616db745","shared_citers":13},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"Long Context Transfer from Language to Vision","work_id":"52f1b946-568f-4819-9d8a-a87296f8852d","shared_citers":11},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":10},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":10},{"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","work_id":"0fe2cfd8-d442-4ceb-b1a9-a465704f39b2","shared_citers":9}],"time_series":[{"n":9,"year":2023},{"n":10,"year":2024},{"n":13,"year":2025},{"n":17,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","context_text":"4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138] 63.3 / 69.0 67.0 1.44 - 55.6 - Qwen2.5-VL-7B [5] 65.1 / 71.6 69.6 1.79 70.2 45.3 63.6 Keye-VL-8B [126] 67.7 / - - - - 64.8 - GLM-4.1V-9B [126] 68.2 / 73.6 68.4 1.63 71.5 65.7 67.0 InternVL3-8B [187] 66.3 / 68.","citing_arxiv_id":"2508.18265"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","context_text":"9 - - InternVL3-1B 51.0 / 53.0 63.1 1.3 53.0 48.1 24.8 / 39.1 46.9 Qwen2-VL-2B [121] 55.6 / 60.4 63.2 - - - - - Qwen2.5-VL-3B [7] 61.5 / 67.6 67.0 1.63 68.2 43.3 - - InternVL2-2B [19] 46.2 / 49.1 60.2 1.30 54.3 46.0 - - InternVL2.5-2B [18] 51.9 / 54.1 68.8 1.44 61.4 52.0 - - InternVL3-2B 58.9 / 61.4 70.4 1.42 64.2 55.4 30.8 / 50.7 54.9 VideoChat2-HD [64] 45.3 / 55.7 62.3 1.22 47.9 - - - MiniCPM-V-2.6 [135] 60.9 / 63.6 - 1.70 - 54.9 - - LLaV A-OneVision-7B [60] 58.2 / - 56.7 - - - - - Qwen2-VL-7B [121] 63.3 / 69.0 67.0 1.44 - 55.6 - - Qwen2.5-VL-7B [7] 65.1 / 71.6 69.6 1.79 70.2 45.3 - - InternVL2-8B [19] 56.3 / 59.3 65.8 1.57 64.0 54.6 - - InternVL2.5-8B [18] 64.2 / 66.9 72.0 1.68 68.9 60.0 - - InternVL3-8B 66.","citing_arxiv_id":"2504.10479"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","context_text":"Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [99], NTU RGB+D (en) [152], CLEVRER (en) [276], LLaV A-Video (en) [307], FineVideo (en) [67], PerceptionTest (en) [193], HiREST (en) [291], STAR (en) [259],General QA EgoSchema (en) [175], ScanQA (en) [10], LSMDC (en) [201] GUI GUI-World (en) [24] Type: Text Datasets UltraFeedback (en) [48], UltraChat (en) [58], Unnatural-Instructions (en) [90], NoRobots (en) [196],","citing_arxiv_id":"2412.05271"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","primary_cat":"cs.CV","context_text":"Text-rich VQA, MIVQA: Multi-image VQA, QB: Q-Bench, SQ: ScanQA, Math: MathVerse-mv, Sci: SciVerse-mv. Model NExTQA MVBench ActivityNet-QA VDD VideoChat-GPT CI DO CU TU CO GPT-4V [43] - - - 4.00 4.09 3.88 4.37 3.94 4.02 VideoChatGPT (7B) [39] - - 35.2/2.70 - 2.40 2.52 2.62 1.98 2.37 Video-LLaV A (7B) [32] - - 45.3/3.30 - 2.87 2.94 3.44 2.45 2.51 VISTA-LLaMA (7B) [38] - - 48.3/3.30 - 2.44 2.31 2.64 3.18 2.26 VideoChat2 (7B) [29] 68.6 51.9 49.1/3.30 - 3.02 2.88 3.51 2.66 2.81 LLaMA-VID (7B) [31] - 50.2 47.4/3.30 2.84 3.01 2.97 3.54 2.53 2.60 LLaV A-NeXT-Video (7B) [67]- - 53.5/3.20 3.32 3.39 3.29 3.92 2.60 3.12 LLaV A-NeXT-Video-DPO (7B)- - 60.2/3.50 3.72 3.64 3.45 4.17 2.95 4.08 LLaV A-NeXT-Video-DPO (34B)- - 64.4/3.60 3.84 3.81 3.55 4.24 3.14 4.12 LLaV A-NeXT-Interleave","citing_arxiv_id":"2407.07895"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","context_text":"Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.2 63.1 74.1 44.7 68.2 78.4 61.5 InternVL-G (ours) 8 42.4 65.9 75.4 46.3 70.5 79.6 63.4 Table 14. Comparison of zero-shot video-text retrieval per- formance on MSR-VTT. \"#F\" denotes the number of frames. † These models are trained with temporal attention layers.","citing_arxiv_id":"2312.14238"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension","primary_cat":"cs.CL","context_text":"2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a comprehensive quantitative evaluations of these models to thoroughly assess and compare their performance in generative comprehension. Benchmarks for Multimodal Large Language Models. With the rapid development of Multi-","citing_arxiv_id":"2307.16125"}]},"error":null,"updated_at":"2026-05-18T03:50:43.305443+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-18T03:50:39.887760+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"VideoChat: Chat-Centric Video Understanding","claims":[{"claim_text":"In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"MLLMs-based video translation. By structuring our survey in this role-oriented manner (see Fig. 1), we aim to provide conceptual clarity and facilitate comparative analysis. The arXiv:2604.11283v1 [cs.CV] 13 Apr 2026 2 Taxonomy The SemanticReasoner Video-Language Alignment MiniGPT4-Video [1], FrozenBiLM [2], Video-ChatGPT [3], Video-LLaMA [4], VideoChat [5], LLaMA-VID [6], Valley [7], Vista-LLaMA [8], IG-VLM [9], VideoChat2 [10], VaQuitA [11], Vamos [12], COSMO [13], IVA [14], MMICT [15], LXMERT","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138]","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [9","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"improved long video understanding accuracy. 2 Related Work Vision-Language Models for Long Sequence Understanding.Early Vision-Language Models (VLMs), such as GPT-4V and Gemini-1.5 [49, 58], showcased powerful multimodal reasoning by integrating visual encoders with large language models. Open-source efforts like Llama-Vid [36], IDEFICS [24], VideoChat [34], Video-LLaMA [12], and others [2, 32, 35, 38, 44, 61, 62] have further advanced capabilities, often matching or exceeding proprietary system","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks VideoChat: Chat-Centric Video Understanding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (8 contexts).","role_counts":[{"n":8,"context_role":"background"},{"n":5,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-18T03:50:35.024402+00:00"}},"summary":{"title":"VideoChat: Chat-Centric Video Understanding","claims":[{"claim_text":"In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"MLLMs-based video translation. By structuring our survey in this role-oriented manner (see Fig. 1), we aim to provide conceptual clarity and facilitate comparative analysis. The arXiv:2604.11283v1 [cs.CV] 13 Apr 2026 2 Taxonomy The SemanticReasoner Video-Language Alignment MiniGPT4-Video [1], FrozenBiLM [2], Video-ChatGPT [3], Video-LLaMA [4], VideoChat [5], LLaMA-VID [6], Valley [7], Vista-LLaMA [8], IG-VLM [9], VideoChat2 [10], VaQuitA [11], Vamos [12], COSMO [13], IVA [14], MMICT [15], LXMERT","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138]","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [9","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"improved long video understanding accuracy. 2 Related Work Vision-Language Models for Long Sequence Understanding.Early Vision-Language Models (VLMs), such as GPT-4V and Gemini-1.5 [49, 58], showcased powerful multimodal reasoning by integrating visual encoders with large language models. Open-source efforts like Llama-Vid [36], IDEFICS [24], VideoChat [34], Video-LLaMA [12], and others [2, 32, 35, 38, 44, 61, 62] have further advanced capabilities, often matching or exceeding proprietary system","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks VideoChat: Chat-Centric Video Understanding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (8 contexts).","role_counts":[{"n":8,"context_role":"background"},{"n":5,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"graph":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":24},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":20},{"title":"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding","work_id":"555cf04a-49a7-44b8-9019-a83ce85ace95","shared_citers":20},{"title":"Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models","work_id":"51f627f4-8fae-4882-a3e9-abdf932ef27b","shared_citers":19},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":17},{"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","shared_citers":16},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":16},{"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","work_id":"e2121c51-a55e-476a-af81-7ba6970fe6cf","shared_citers":16},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":15},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":14},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":14},{"title":"MMBench: Is Your Multi-modal Model an All-around Player?","work_id":"3b44943d-0f15-4228-9ac3-0e376f4f9ada","shared_citers":14},{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","shared_citers":14},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":14},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":13},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":13},{"title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","work_id":"7f3bac41-a0a5-4a7a-bfd2-526b616db745","shared_citers":13},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"Long Context Transfer from Language to Vision","work_id":"52f1b946-568f-4819-9d8a-a87296f8852d","shared_citers":11},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":10},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":10},{"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","work_id":"0fe2cfd8-d442-4ceb-b1a9-a465704f39b2","shared_citers":9}],"time_series":[{"n":9,"year":2023},{"n":10,"year":2024},{"n":13,"year":2025},{"n":17,"year":2026}],"dependency_candidates":[{"n":1,"role":"baseline","polarity":"baseline","paper_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","primary_cat":"cs.CV","context_text":"4 / 55.0 61.0 1.39 56.6 53.0 54.1 Qwen2-VL-2B [138] 55.6 / 60.4 63.2 - - - - Qwen2.5-VL-3B [5] 61.5 / 67.6 67.0 1.63 68.2 43.3 60.3 InternVL3-2B [187] 58.9 / 61.4 70.4 1.42 64.2 55.4 59.6 InternVL3.5-2B 58.4 / 61.9 65.9 1.56 64.4 57.4 60.0 MiniCPM-V-4-4B [164] 61.2 / 65.8 58.7 - - - - InternVL3.5-4B 65.4 / 68.6 71.2 1.59 70.4 60.8 64.9 VideoChat2-HD [62] 45.3 / 55.7 62.3 1.22 47.9 - - LLaV A-OneVision-7B [58] 58.2 / - 56.7 - - - - MiniCPM-V-2.6 [164] 60.9 / 63.6 - 1.70 - 54.9 - Qwen2-VL-7B [138] 63.3 / 69.0 67.0 1.44 - 55.6 - Qwen2.5-VL-7B [5] 65.1 / 71.6 69.6 1.79 70.2 45.3 63.6 Keye-VL-8B [126] 67.7 / - - - - 64.8 - GLM-4.1V-9B [126] 68.2 / 73.6 68.4 1.63 71.5 65.7 67.0 InternVL3-8B [187] 66.3 / 68.","citing_arxiv_id":"2508.18265"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","primary_cat":"cs.CV","context_text":"9 - - InternVL3-1B 51.0 / 53.0 63.1 1.3 53.0 48.1 24.8 / 39.1 46.9 Qwen2-VL-2B [121] 55.6 / 60.4 63.2 - - - - - Qwen2.5-VL-3B [7] 61.5 / 67.6 67.0 1.63 68.2 43.3 - - InternVL2-2B [19] 46.2 / 49.1 60.2 1.30 54.3 46.0 - - InternVL2.5-2B [18] 51.9 / 54.1 68.8 1.44 61.4 52.0 - - InternVL3-2B 58.9 / 61.4 70.4 1.42 64.2 55.4 30.8 / 50.7 54.9 VideoChat2-HD [64] 45.3 / 55.7 62.3 1.22 47.9 - - - MiniCPM-V-2.6 [135] 60.9 / 63.6 - 1.70 - 54.9 - - LLaV A-OneVision-7B [60] 58.2 / - 56.7 - - - - - Qwen2-VL-7B [121] 63.3 / 69.0 67.0 1.44 - 55.6 - - Qwen2.5-VL-7B [7] 65.1 / 71.6 69.6 1.79 70.2 45.3 - - InternVL2-8B [19] 56.3 / 59.3 65.8 1.57 64.0 54.6 - - InternVL2.5-8B [18] 64.2 / 66.9 72.0 1.68 68.9 60.0 - - InternVL3-8B 66.","citing_arxiv_id":"2504.10479"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","context_text":"Img-Diff (en) [101], Birds-to-Words (en) [100], Spot-the-Diff (en) [100], MultiVQA (en) [100], NLVR2 (en) [216],General QA ContrastiveCaption (en) [100], DreamSim (en) [100], InternVL-SA-1B-Caption (en & zh) [36] Document MP-DocVQA (en) [233], MP-Docmatix (en) [121] Type: Video Datasets Vript (en & zh) [269], OpenVid (en) [190], Mementos (en) [254], ShareGPT4o-Video (en & zh) [35],Captioning ShareGPT4Video (en & zh) [30], VideoGPT+ (en) [174] VideoChat2-IT (en & zh) [130, 131], EgoTaskQA (en) [99], NTU RGB+D (en) [152], CLEVRER (en) [276], LLaV A-Video (en) [307], FineVideo (en) [67], PerceptionTest (en) [193], HiREST (en) [291], STAR (en) [259],General QA EgoSchema (en) [175], ScanQA (en) [10], LSMDC (en) [201] GUI GUI-World (en) [24] Type: Text Datasets UltraFeedback (en) [48], UltraChat (en) [58], Unnatural-Instructions (en) [90], NoRobots (en) [196],","citing_arxiv_id":"2412.05271"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","primary_cat":"cs.CV","context_text":"Text-rich VQA, MIVQA: Multi-image VQA, QB: Q-Bench, SQ: ScanQA, Math: MathVerse-mv, Sci: SciVerse-mv. Model NExTQA MVBench ActivityNet-QA VDD VideoChat-GPT CI DO CU TU CO GPT-4V [43] - - - 4.00 4.09 3.88 4.37 3.94 4.02 VideoChatGPT (7B) [39] - - 35.2/2.70 - 2.40 2.52 2.62 1.98 2.37 Video-LLaV A (7B) [32] - - 45.3/3.30 - 2.87 2.94 3.44 2.45 2.51 VISTA-LLaMA (7B) [38] - - 48.3/3.30 - 2.44 2.31 2.64 3.18 2.26 VideoChat2 (7B) [29] 68.6 51.9 49.1/3.30 - 3.02 2.88 3.51 2.66 2.81 LLaMA-VID (7B) [31] - 50.2 47.4/3.30 2.84 3.01 2.97 3.54 2.53 2.60 LLaV A-NeXT-Video (7B) [67]- - 53.5/3.20 3.32 3.39 3.29 3.92 2.60 3.12 LLaV A-NeXT-Video-DPO (7B)- - 60.2/3.50 3.72 3.64 3.45 4.17 2.95 4.08 LLaV A-NeXT-Video-DPO (34B)- - 64.4/3.60 3.84 3.81 3.55 4.24 3.14 4.12 LLaV A-NeXT-Interleave","citing_arxiv_id":"2407.07895"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","context_text":"Video → Text Text → Videomethod #F R@1 R@5 R@10 R@1 R@5 R@10 avg. OpenAI CLIP-L [117] 1 27.8 49.4 58.0 29.0 50.5 59.2 45.7 InternVL-C (ours) 1 35.3 56.6 66.6 37.5 60.9 70.9 54.6 InternVL-G (ours) 1 36.6 58.3 67.7 39.1 61.7 70.7 55.7 OpenAI CLIP-L [117] 8 26.6 50.8 61.8 30.7 54.4 64.0 48.1 Florence [171] 8 - - - 37.6 63.8 72.6 - InternVideo† [151] 8 39.6 - - 40.7 - - - UMT-L† [83] 8 38.6 59.8 69.6 42.6 64.4 73.1 58.0 LanguageBind† [186] 8 40.9 66.4 75.7 44.8 70.0 78.7 62.8 InternVL-C (ours) 8 40.2 63.1 74.1 44.7 68.2 78.4 61.5 InternVL-G (ours) 8 42.4 65.9 75.4 46.3 70.5 79.6 63.4 Table 14. Comparison of zero-shot video-text retrieval per- formance on MSR-VTT. \"#F\" denotes the number of frames. † These models are trained with temporal attention layers.","citing_arxiv_id":"2312.14238"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension","primary_cat":"cs.CL","context_text":"2 Related Work Multimodal Large Language Models. With the impressive success of Large language models (LLM) [1, 5, 4], recent studies work on generative Multimodal Large Language Models (MLLMs) [6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 20, 21] to improve multimodal comprehension and generation through utilizing the strong generality of LLMs. Some work [ 15, 16, 17] further considers video inputs and leverage the vast capabilities of LLMs for video understanding tasks. In SEED-Bench, we provide a comprehensive quantitative evaluations of these models to thoroughly assess and compare their performance in generative comprehension. Benchmarks for Multimodal Large Language Models. With the rapid development of Multi-","citing_arxiv_id":"2307.16125"}]},"authors":[]}}