{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:42WE65LJACTHJRNRQPEDHHNNVL","short_pith_number":"pith:42WE65LJ","schema_version":"1.0","canonical_sha256":"e6ac4f756900a674c5b183c8339dadaaf431922b66d39b98d7b5444ed9656249","source":{"kind":"arxiv","id":"2509.18154","version":1},"attestation_state":"computed","paper":{"title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time.","cross_cats":["cs.CV"],"primary_cat":"cs.LG","authors_text":"Bingxiang He, Bokai Xu, Chi Chen, Chongyi Wang, Fuwei Huang, Ganqu Cui, Guoyang Zeng, Hanyu Liu, Hongyuan Liu, Jie Cai, Jie Zhou, Jingkun Tang, Ji Qi, Junbo Cui, Liqing Ruan, Luoyuan Zhang, Maosong Sun, Ning Ding, Qining Guo, Tianchi Cai, Tianyu Yu, Weize Chen, Wenhao Hu, Wenshuo Ma, Xu Han, Yingjing Xu, Yuanqian Zhao, Yuan Yao, Yuxiang Huang, Yuxuan Li, Zefan Wang, Zhihui He, Zhiyuan Liu, Zonghao Guo","submitted_at":"2025-09-16T19:41:48Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for docum"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2509.18154","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2025-09-16T19:41:48Z","cross_cats_sorted":["cs.CV"],"title_canon_sha256":"551c7e34431a4ea04a6cfcb882aee58a714c3bdacb0994a04809586ae68d93be","abstract_canon_sha256":"e5cfdbed2eade0b1e05dc98ba7f76b24012e110da40b86f4e30df28e86dba62e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.810493Z","signature_b64":"K2EuBltsjgV1sx+3LSaIDT2gixFYvmd74F60C+C9zS5p3ssMg9IMoDYeVz7G/lxESseYTAtA2yotc3OGoHMeDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"e6ac4f756900a674c5b183c8339dadaaf431922b66d39b98d7b5444ed9656249","last_reissued_at":"2026-05-17T23:38:50.810052Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.810052Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time.","cross_cats":["cs.CV"],"primary_cat":"cs.LG","authors_text":"Bingxiang He, Bokai Xu, Chi Chen, Chongyi Wang, Fuwei Huang, Ganqu Cui, Guoyang Zeng, Hanyu Liu, Hongyuan Liu, Jie Cai, Jie Zhou, Jingkun Tang, Ji Qi, Junbo Cui, Liqing Ruan, Luoyuan Zhang, Maosong Sun, Ning Ding, Qining Guo, Tianchi Cai, Tianyu Yu, Weize Chen, Wenhao Hu, Wenshuo Ma, Xu Han, Yingjing Xu, Yuanqian Zhao, Yuan Yao, Yuxiang Huang, Yuxuan Li, Zefan Wang, Zhihui He, Zhiyuan Liu, Zonghao Guo","submitted_at":"2025-09-16T19:41:48Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for docum"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MiniCPM-V 4.5 surpasses GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using 46.7% GPU memory and 8.7% inference time of Qwen2.5-VL 7B on VideoMME.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The reported benchmarks and efficiency measurements generalize beyond the specific evaluation suites and hardware configurations used in the experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2ceb82c42b15357dbfb0a7f6377d0f0009d2bdd9b2a8f365a98d4d58bc6bfdc6"},"source":{"id":"2509.18154","kind":"arxiv","version":1},"verdict":{"id":"47412f23-1495-4992-8839-3dbedf72224e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T17:01:57.476728Z","strongest_claim":"MiniCPM-V 4.5 surpasses GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using 46.7% GPU memory and 8.7% inference time of Qwen2.5-VL 7B on VideoMME.","one_line_summary":"An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The reported benchmarks and efficiency measurements generalize beyond the specific evaluation suites and hardware configurations used in the experiments.","pith_extraction_headline":"An 8B multimodal model outperforms GPT-4o-latest and Qwen2.5-VL 72B on OpenCompass while using far less memory and inference time."},"references":{"count":77,"sample":[{"doi":"","year":2025,"title":"Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025","work_id":"286e20cc-8045-4f8f-9d57-8a1a3db3a56a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fan","work_id":"603eef21-7c9d-450f-92a4-3e490fcf8b55","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Mimo-vl technical report, 2025","work_id":"0cf78c2c-df81-4b1a-afb1-e5fe9fa2e920","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Openai platform chatgpt-4o, 2025","work_id":"32c7e11a-916a-4e6f-9d1e-b5f9f2663738","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond","work_id":"b5c6eee0-36d1-4d4a-8258-f4ef8294b341","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":77,"snapshot_sha256":"71ca9b7fc257e7f71722b9db510a680febebcec982d40ba96ff6dace6e64cddd","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2509.18154","created_at":"2026-05-17T23:38:50.810125+00:00"},{"alias_kind":"arxiv_version","alias_value":"2509.18154v1","created_at":"2026-05-17T23:38:50.810125+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2509.18154","created_at":"2026-05-17T23:38:50.810125+00:00"},{"alias_kind":"pith_short_12","alias_value":"42WE65LJACTH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"42WE65LJACTHJRNR","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"42WE65LJ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":32,"sample":[{"citing_arxiv_id":"2605.22185","citing_title":"Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04802","citing_title":"VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21411","citing_title":"RoadTones: Tone Controllable Text Generation from Road Event Videos","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16416","citing_title":"CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17093","citing_title":"HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18740","citing_title":"Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18984","citing_title":"Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19522","citing_title":"iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20035","citing_title":"Stage-adaptive Token Selection for Efficient Omni-modal LLMs","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22186","citing_title":"MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21863","citing_title":"Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07026","citing_title":"Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2601.10611","citing_title":"Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding","ref_index":176,"is_internal_anchor":true},{"citing_arxiv_id":"2603.03944","citing_title":"SCP: Spatial Causal Prediction in Video","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13073","citing_title":"OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14664","citing_title":"MiVE: Multiscale Vision-language features for reference-guided video Editing","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13277","citing_title":"Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03318","citing_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03231","citing_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11960","citing_title":"Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10079","citing_title":"SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03903","citing_title":"CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24339","citing_title":"See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05955","citing_title":"TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22884","citing_title":"Can Multimodal Large Language Models Truly Understand Small Objects?","ref_index":51,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL","json":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL.json","graph_json":"https://pith.science/api/pith-number/42WE65LJACTHJRNRQPEDHHNNVL/graph.json","events_json":"https://pith.science/api/pith-number/42WE65LJACTHJRNRQPEDHHNNVL/events.json","paper":"https://pith.science/paper/42WE65LJ"},"agent_actions":{"view_html":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL","download_json":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL.json","view_paper":"https://pith.science/paper/42WE65LJ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2509.18154&json=true","fetch_graph":"https://pith.science/api/pith-number/42WE65LJACTHJRNRQPEDHHNNVL/graph.json","fetch_events":"https://pith.science/api/pith-number/42WE65LJACTHJRNRQPEDHHNNVL/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL/action/timestamp_anchor","attest_storage":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL/action/storage_attestation","attest_author":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL/action/author_attestation","sign_citation":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL/action/citation_signature","submit_replication":"https://pith.science/pith/42WE65LJACTHJRNRQPEDHHNNVL/action/replication_record"}},"created_at":"2026-05-17T23:38:50.810125+00:00","updated_at":"2026-05-17T23:38:50.810125+00:00"}