{"paper":{"title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time.","cross_cats":["cs.CV"],"primary_cat":"cs.LG","authors_text":"Bingxiang He, Bokai Xu, Chi Chen, Chongyi Wang, Fuwei Huang, Ganqu Cui, Guoyang Zeng, Hanyu Liu, Hongyuan Liu, Jie Cai, Jie Zhou, Jingkun Tang, Ji Qi, Junbo Cui, Liqing Ruan, Luoyuan Zhang, Maosong Sun, Ning Ding, Qining Guo, Tianchi Cai, Tianyu Yu, Weize Chen, Wenhao Hu, Wenshuo Ma, Xu Han, Yingjing Xu, Yuanqian Zhao, Yuan Yao, Yuxiang Huang, Yuxuan Li, Zefan Wang, Zhihui He, Zhiyuan Liu, Zonghao Guo","submitted_at":"2025-09-16T19:41:48Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for docum"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MiniCPM-V 4.5 surpasses GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using 46.7% GPU memory and 8.7% inference time of Qwen2.5-VL 7B on VideoMME.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The reported benchmarks and efficiency measurements generalize beyond the specific evaluation suites and hardware configurations used in the experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2ceb82c42b15357dbfb0a7f6377d0f0009d2bdd9b2a8f365a98d4d58bc6bfdc6"},"source":{"id":"2509.18154","kind":"arxiv","version":1},"verdict":{"id":"47412f23-1495-4992-8839-3dbedf72224e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T17:01:57.476728Z","strongest_claim":"MiniCPM-V 4.5 surpasses GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using 46.7% GPU memory and 8.7% inference time of Qwen2.5-VL 7B on VideoMME.","one_line_summary":"An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The reported benchmarks and efficiency measurements generalize beyond the specific evaluation suites and hardware configurations used in the experiments.","pith_extraction_headline":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time."},"references":{"count":77,"sample":[{"doi":"","year":2025,"title":"Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025","work_id":"286e20cc-8045-4f8f-9d57-8a1a3db3a56a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fan","work_id":"603eef21-7c9d-450f-92a4-3e490fcf8b55","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Mimo-vl technical report, 2025","work_id":"0cf78c2c-df81-4b1a-afb1-e5fe9fa2e920","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Openai platform chatgpt-4o, 2025","work_id":"32c7e11a-916a-4e6f-9d1e-b5f9f2663738","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond","work_id":"b5c6eee0-36d1-4d4a-8258-f4ef8294b341","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":77,"snapshot_sha256":"71ca9b7fc257e7f71722b9db510a680febebcec982d40ba96ff6dace6e64cddd","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}