{"paper":{"title":"Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bo Li, Fanyi Pu, Kairui Hu, Penghao Wu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Ziwei Liu","submitted_at":"2025-01-23T16:51:47Z","abstract_excerpt":"Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU fea"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 300 videos and 900 human-annotated questions accurately and unbiasedly capture the three cognitive stages of knowledge acquisition without selection or annotation artifacts affecting the measured gaps.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a19b6db4967651cc8c9efef686b380bf0ced55a794e317c72d1007133b445769"},"source":{"id":"2501.13826","kind":"arxiv","version":1},"verdict":{"id":"9beedc01-02fe-4c6d-96d7-57f99f0ea6c7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T00:27:45.966463Z","strongest_claim":"Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.","one_line_summary":"Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 300 videos and 900 human-annotated questions accurately and unbiasedly capture the three cognitive stages of knowledge acquisition without selection or annotation artifacts affecting the measured gaps.","pith_extraction_headline":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation."},"references":{"count":62,"sample":[{"doi":"","year":null,"title":"Anthropic. Claude Team. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/claude/sonnet ,","work_id":"440e81fa-cd1f-404b-8d37-c4d8c28910e4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"A systematic classification of knowl- edge, reasoning, and context within the ARC dataset","work_id":"f14b7cda-005a-43d5-8a6f-8cfeaeac4480","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Temporalbench: Towards fine-grained temporal understanding for multimodal video models","work_id":"0f787776-4151-41f7-8d50-6174eb1340d3","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Auroracap: Efficient, performant video detailed captioning and a new benchmark","work_id":"3c75ba15-49f8-4ea3-87a8-6c357f825176","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing","work_id":"a448c544-28f6-42bc-a57f-0f0e879ea3b2","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":62,"snapshot_sha256":"2005f98eaf47fc1226002149ae12f04ca5952540dcbcdf5e76223f3363541d2e","internal_anchors":14},"formal_canon":{"evidence_count":1,"snapshot_sha256":"a2c7abe510327495ce8d16c5310ff305c8b3b59ccdd3bce6704668fcfcdc9d17"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}