{"work":{"id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","openalex_id":null,"doi":null,"arxiv_id":"2406.07476","raw_key":null,"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","authors":null,"authors_text":"Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen","year":2024,"venue":"cs.CV","abstract":"In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.","external_url":"https://arxiv.org/abs/2406.07476","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:55:25.022050+00:00","pith_arxiv_id":"2406.07476","created_at":"2026-05-10T06:26:27.223405+00:00","updated_at":"2026-05-25T05:55:25.022050+00:00","title_quality_ok":true,"display_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","render_title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs"},"hub":{"state":{"work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":90,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2024-06-06T17:09:32+00:00","last_pith_cited_at":"2026-05-21T18:00:22+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T12:22:39.340371+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":20},{"context_role":"baseline","n":10},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":20},{"context_polarity":"baseline","n":10},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:11:17.628017+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":21},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":19},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":14},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":14},{"title":"Qwen2.5-Omni Technical Report","work_id":"438f105c-fa9b-44aa-ad52-43acb8045cda","shared_citers":14},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":14},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":12},{"title":"Long Context Transfer from Language to Vision","work_id":"52f1b946-568f-4819-9d8a-a87296f8852d","shared_citers":12},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":11},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":10},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":10},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":9},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":9},{"title":"VideoChat: Chat-Centric Video Understanding","work_id":"07461eec-156c-4054-a28e-b84bc53bf6e1","shared_citers":9},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":8},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":8},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":8},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":7},{"title":"Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264, 2(5):6","work_id":"346256da-dd21-4cc3-9a98-519467614854","shared_citers":7},{"title":"Pllava: Parameter-free llava extension from images to videos for video dense captioning","work_id":"8949d4db-20f2-47c1-83a6-fcbe041b62ef","shared_citers":7},{"title":"Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning","work_id":"7be17d59-6cde-455a-99c3-06e28659839f","shared_citers":7}],"time_series":[{"n":2,"year":2024},{"n":5,"year":2025},{"n":41,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T14:11:15.743766+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:11:20.002164+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","claims":[{"claim_text":"In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Com","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T14:11:11.527265+00:00"}},"summary":{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","claims":[{"claim_text":"In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Com","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":21},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":19},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":14},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":14},{"title":"Qwen2.5-Omni Technical Report","work_id":"438f105c-fa9b-44aa-ad52-43acb8045cda","shared_citers":14},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":14},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":12},{"title":"Long Context Transfer from Language to Vision","work_id":"52f1b946-568f-4819-9d8a-a87296f8852d","shared_citers":12},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":11},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":10},{"title":"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis","work_id":"77fd5ac9-ae98-4846-9d83-e9c73c8f2a52","shared_citers":10},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":9},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":9},{"title":"VideoChat: Chat-Centric Video Understanding","work_id":"07461eec-156c-4054-a28e-b84bc53bf6e1","shared_citers":9},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":8},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":8},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":8},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":7},{"title":"Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264, 2(5):6","work_id":"346256da-dd21-4cc3-9a98-519467614854","shared_citers":7},{"title":"Pllava: Parameter-free llava extension from images to videos for video dense captioning","work_id":"8949d4db-20f2-47c1-83a6-fcbe041b62ef","shared_citers":7},{"title":"Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning","work_id":"7be17d59-6cde-455a-99c3-06e28659839f","shared_citers":7}],"time_series":[{"n":2,"year":2024},{"n":5,"year":2025},{"n":41,"year":2026}],"dependency_candidates":[]},"authors":[]}}