{"work":{"id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","openalex_id":null,"doi":null,"arxiv_id":"2412.05271","raw_key":null,"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","authors":null,"authors_text":"Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui","year":2024,"venue":"cs.CV","abstract":"We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL","external_url":"https://arxiv.org/abs/2412.05271","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:35:28.707491+00:00","pith_arxiv_id":"2412.05271","created_at":"2026-05-09T05:45:22.914363+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","render_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling"},"hub":{"state":{"work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":183,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2024-12-05T18:58:26+00:00","last_pith_cited_at":"2026-05-22T04:19:29+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-06T10:40:40.852742+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":29},{"context_role":"baseline","n":18},{"context_role":"method","n":4},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":29},{"context_polarity":"baseline","n":18},{"context_polarity":"use_method","n":5},{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","claims":[{"claim_text":"We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model, 2025. [9] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. [10] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et a","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"out-of-domain performance, especially on multi-view SPBench and video-only SITE-Bench, demon- strating its advantage in cross-view and video spatial reasoning. Methods SPBench-SI SPBench-MV SPBench SITE-Bench NQ MCQ Avg. NQ MCQ Avg. Proprietary Models GPT-4o [40] 24.5 60.3 42.4 40.7 59.4 50.1 46.2 - Gemini-2.0-Flash [11] 49.0 60.4 54.7 51.9 50.7 51.3 53.0 - Open-Source Models InternVL-2.5-8B [7] 28.3 56.3 42.3 37.3 47.5 42.4 42.3 - Kimi-VL-A3B [37] 25.7 44.9 35.3 23.3 57.6 40.5 37.9 - LLaV A-One","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"We find that such reasoning significantly enhances performance. Finally, in Section 7.5, we provide qualitative visualizations that showcase BAGEL's world modeling abilities, including world navigation and video generation. 7.1 Image Understanding Type Model # LLM Params MME-P ↑ MME-S↑ MMBench↑ MMMU↑ MM-Vet↑ MathVista↑ MMVP↑ Und. Only InternVL2 [13] 1.8B 1440 1877 73.2 34.3 44.6 46.4 35.3 InternVL2.5 [12] 1.8B - 2138 74.7 43.6 60.8 51.3 - Qwen2-VL[77] 1.5B - 1872 74.9 41.1 49.5 43.0 - Qwen2.5-VL","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Furthermore, advanced post-training strategies-comprising supervised fine-tuning (SFT) and mixed preference optimization (MPO) [124]-together with test-time scaling strategies [ 125] and an optimized training infrastructure [15], significantly enhance InternVL3's efficiency and performance. Comprehensive empirical evaluations demonstrate that InternVL3 surpasses its predecessors ( e.g., In- ternVL2.5 [18]) across a wide range of tasks, including multi-discipline reasoning, document understanding","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"jector and finetuning a decoder-only LLM (see §4), and for spatial tasks we train with several different decoders (ViTDet [72] Mask-RCNN [43] with Ab- solute Win [7] for detection, DPT [109] for depth, and zero-shot feature corre- spondance for tracking [52]). For each experiment, we sweep over the layers of the model as the optimal features are not necessarily the last [18]. In each case, we use an equivalent image size (window size for detection) of32 × 32 to- kens. In each plot, we normalize ","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"community. Furthermore, the growth of multimodal capabilities, e.g., long visual context and high-resolution understanding [5, 81, 140, 147, 180, 187], often comes with ever increasing computational costs, which have become a crucial bottleneck of real-world applications. In this work, we introduce InternVL3.5, an advanced family of InternVL series [13, 14, 15, 37, 79, 80, 187] with stronger capabilities in versatility, reasoning, and efficiency. Compared to InternVL3 [187], InternVL3.5 achieves","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (4 contexts).","role_counts":[{"n":4,"context_role":"background"},{"n":3,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-15T15:28:05.569600+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"e0ad149a-61e6-45b7-b8cc-af13f895c748","orcid":null,"display_name":"Zhe Chen"},{"id":"58cc9d30-7cb9-4659-b9bb-e69cbcad0c55","orcid":null,"display_name":"Weiyun Wang"},{"id":"566cd454-fb7d-4f54-81bc-56bc6144c48c","orcid":null,"display_name":"Yue Cao"},{"id":"6ccc605e-3e12-4b7e-86d9-6a3f75f91932","orcid":null,"display_name":"Yangzhou Liu"},{"id":"a8152194-e272-4c2e-af7f-6f194f98f21a","orcid":null,"display_name":"Zhangwei Gao"},{"id":"a96f6a80-0aa2-40ef-9cae-c04c11c86505","orcid":null,"display_name":"Erfei Cui"}]},"error":null,"updated_at":"2026-05-15T15:28:05.550916+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:17:12.615193+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":59},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":34},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":30},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":27},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":26},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":22},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":20},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":19},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":18},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":16},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":16},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":15},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":13},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":13},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":12},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":12},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":11},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","shared_citers":11},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":11},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":10},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":10},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":9}],"time_series":[{"n":12,"year":2025},{"n":79,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:17:04.480566+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:17:08.559790+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","claims":[{"claim_text":"We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model, 2025. [9] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024. [10] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et a","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"out-of-domain performance, especially on multi-view SPBench and video-only SITE-Bench, demon- strating its advantage in cross-view and video spatial reasoning. Methods SPBench-SI SPBench-MV SPBench SITE-Bench NQ MCQ Avg. NQ MCQ Avg. Proprietary Models GPT-4o [40] 24.5 60.3 42.4 40.7 59.4 50.1 46.2 - Gemini-2.0-Flash [11] 49.0 60.4 54.7 51.9 50.7 51.3 53.0 - Open-Source Models InternVL-2.5-8B [7] 28.3 56.3 42.3 37.3 47.5 42.4 42.3 - Kimi-VL-A3B [37] 25.7 44.9 35.3 23.3 57.6 40.5 37.9 - LLaV A-One","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"We find that such reasoning significantly enhances performance. Finally, in Section 7.5, we provide qualitative visualizations that showcase BAGEL's world modeling abilities, including world navigation and video generation. 7.1 Image Understanding Type Model # LLM Params MME-P ↑ MME-S↑ MMBench↑ MMMU↑ MM-Vet↑ MathVista↑ MMVP↑ Und. Only InternVL2 [13] 1.8B 1440 1877 73.2 34.3 44.6 46.4 35.3 InternVL2.5 [12] 1.8B - 2138 74.7 43.6 60.8 51.3 - Qwen2-VL[77] 1.5B - 1872 74.9 41.1 49.5 43.0 - Qwen2.5-VL","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Furthermore, advanced post-training strategies-comprising supervised fine-tuning (SFT) and mixed preference optimization (MPO) [124]-together with test-time scaling strategies [ 125] and an optimized training infrastructure [15], significantly enhance InternVL3's efficiency and performance. Comprehensive empirical evaluations demonstrate that InternVL3 surpasses its predecessors ( e.g., In- ternVL2.5 [18]) across a wide range of tasks, including multi-discipline reasoning, document understanding","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"jector and finetuning a decoder-only LLM (see §4), and for spatial tasks we train with several different decoders (ViTDet [72] Mask-RCNN [43] with Ab- solute Win [7] for detection, DPT [109] for depth, and zero-shot feature corre- spondance for tracking [52]). For each experiment, we sweep over the layers of the model as the optimal features are not necessarily the last [18]. In each case, we use an equivalent image size (window size for detection) of32 × 32 to- kens. In each plot, we normalize ","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"community. Furthermore, the growth of multimodal capabilities, e.g., long visual context and high-resolution understanding [5, 81, 140, 147, 180, 187], often comes with ever increasing computational costs, which have become a crucial bottleneck of real-world applications. In this work, we introduce InternVL3.5, an advanced family of InternVL series [13, 14, 15, 37, 79, 80, 187] with stronger capabilities in versatility, reasoning, and efficiency. Compared to InternVL3 [187], InternVL3.5 achieves","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (4 contexts).","role_counts":[{"n":4,"context_role":"background"},{"n":3,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-15T15:28:05.558634+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","claims":[{"claim_text":"We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:17:08.566217+00:00"}},"summary":{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","claims":[{"claim_text":"We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":59},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":34},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":30},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":27},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":26},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":22},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":20},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":19},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":18},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":16},{"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","shared_citers":16},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":15},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":13},{"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","shared_citers":13},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":12},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":12},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":11},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","shared_citers":11},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":11},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":10},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":10},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":9}],"time_series":[{"n":12,"year":2025},{"n":79,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"a96f6a80-0aa2-40ef-9cae-c04c11c86505","orcid":null,"display_name":"Erfei Cui","source":"manual","import_confidence":0.72},{"id":"58cc9d30-7cb9-4659-b9bb-e69cbcad0c55","orcid":null,"display_name":"Weiyun Wang","source":"manual","import_confidence":0.72},{"id":"6ccc605e-3e12-4b7e-86d9-6a3f75f91932","orcid":null,"display_name":"Yangzhou Liu","source":"manual","import_confidence":0.72},{"id":"566cd454-fb7d-4f54-81bc-56bc6144c48c","orcid":null,"display_name":"Yue Cao","source":"manual","import_confidence":0.72},{"id":"a8152194-e272-4c2e-af7f-6f194f98f21a","orcid":null,"display_name":"Zhangwei Gao","source":"manual","import_confidence":0.72},{"id":"e0ad149a-61e6-45b7-b8cc-af13f895c748","orcid":null,"display_name":"Zhe Chen","source":"manual","import_confidence":0.72}]}}