{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:YEK2EB3UZILCNPFUT75SKOASPQ","short_pith_number":"pith:YEK2EB3U","schema_version":"1.0","canonical_sha256":"c115a20774ca1626bcb49ffb2538127c01322595c20ebd0fdb3baf0c12ded52d","source":{"kind":"arxiv","id":"2501.13826","version":1},"attestation_state":"computed","paper":{"title":"Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bo Li, Fanyi Pu, Kairui Hu, Penghao Wu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Ziwei Liu","submitted_at":"2025-01-23T16:51:47Z","abstract_excerpt":"Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU fea"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2501.13826","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-01-23T16:51:47Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"d04a3d1b3579b429fd81cd2b06c42dc5c53e786c742b441ef725394c06e527a3","abstract_canon_sha256":"e26b739eaab3a336fdbef524f8c1175a46f3aac3cfe83c64a58b41e5c4e16c7a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:19:23.485951Z","signature_b64":"fzRjTtg2DAHONF5N0v2tha90jX2a1qjgvWuprSiYRAINJIzZSneCLX869nRay7EXsA+9RQrobqkeliioRAVRAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c115a20774ca1626bcb49ffb2538127c01322595c20ebd0fdb3baf0c12ded52d","last_reissued_at":"2026-05-18T03:19:23.485360Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:19:23.485360Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bo Li, Fanyi Pu, Kairui Hu, Penghao Wu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Ziwei Liu","submitted_at":"2025-01-23T16:51:47Z","abstract_excerpt":"Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU fea"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 300 videos and 900 human-annotated questions accurately and unbiasedly capture the three cognitive stages of knowledge acquisition without selection or annotation artifacts affecting the measured gaps.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a19b6db4967651cc8c9efef686b380bf0ced55a794e317c72d1007133b445769"},"source":{"id":"2501.13826","kind":"arxiv","version":1},"verdict":{"id":"9beedc01-02fe-4c6d-96d7-57f99f0ea6c7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T00:27:45.966463Z","strongest_claim":"Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.","one_line_summary":"Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 300 videos and 900 human-annotated questions accurately and unbiasedly capture the three cognitive stages of knowledge acquisition without selection or annotation artifacts affecting the measured gaps.","pith_extraction_headline":"Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation."},"references":{"count":62,"sample":[{"doi":"","year":null,"title":"Anthropic. Claude Team. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/claude/sonnet ,","work_id":"440e81fa-cd1f-404b-8d37-c4d8c28910e4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"A systematic classification of knowl- edge, reasoning, and context within the ARC dataset","work_id":"f14b7cda-005a-43d5-8a6f-8cfeaeac4480","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Temporalbench: Towards fine-grained temporal understanding for multimodal video models","work_id":"0f787776-4151-41f7-8d50-6174eb1340d3","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Auroracap: Efficient, performant video detailed captioning and a new benchmark","work_id":"3c75ba15-49f8-4ea3-87a8-6c357f825176","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing","work_id":"a448c544-28f6-42bc-a57f-0f0e879ea3b2","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":62,"snapshot_sha256":"2005f98eaf47fc1226002149ae12f04ca5952540dcbcdf5e76223f3363541d2e","internal_anchors":14},"formal_canon":{"evidence_count":1,"snapshot_sha256":"a2c7abe510327495ce8d16c5310ff305c8b3b59ccdd3bce6704668fcfcdc9d17"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2501.13826","created_at":"2026-05-18T03:19:23.485470+00:00"},{"alias_kind":"arxiv_version","alias_value":"2501.13826v1","created_at":"2026-05-18T03:19:23.485470+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2501.13826","created_at":"2026-05-18T03:19:23.485470+00:00"},{"alias_kind":"pith_short_12","alias_value":"YEK2EB3UZILC","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YEK2EB3UZILCNPFU","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YEK2EB3U","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":42,"internal_anchor_count":42,"sample":[{"citing_arxiv_id":"2605.06094","citing_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22907","citing_title":"VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2502.13923","citing_title":"Qwen2.5-VL Technical Report","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20785","citing_title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21931","citing_title":"EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22819","citing_title":"Cambrian-P: Pose-Grounded Video Understanding","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15764","citing_title":"GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17283","citing_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18162","citing_title":"Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2507.06261","citing_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2511.04670","citing_title":"Cambrian-S: Towards Spatial Supersensing in Video","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2511.11113","citing_title":"VIDEOP2R: Video Understanding from Perception to Reasoning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2505.21374","citing_title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2511.19972","citing_title":"Boosting Reasoning in Large Multimodal Models via Activation Replay","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03043","citing_title":"OneThinker: All-in-one Reasoning Model for Image and Video","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2602.11509","citing_title":"Multimodal Fact-Level Attribution for Verifiable Reasoning","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09568","citing_title":"EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2603.20633","citing_title":"Seed1.8 Model Card: Towards Generalized Real-World Agency","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12954","citing_title":"AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.01824","citing_title":"STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21776","citing_title":"Video-R1: Reinforcing Video Reasoning in MLLMs","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06094","citing_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09874","citing_title":"EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding","ref_index":94,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09649","citing_title":"Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10050","citing_title":"EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs","ref_index":11,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ","json":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ.json","graph_json":"https://pith.science/api/pith-number/YEK2EB3UZILCNPFUT75SKOASPQ/graph.json","events_json":"https://pith.science/api/pith-number/YEK2EB3UZILCNPFUT75SKOASPQ/events.json","paper":"https://pith.science/paper/YEK2EB3U"},"agent_actions":{"view_html":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ","download_json":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ.json","view_paper":"https://pith.science/paper/YEK2EB3U","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2501.13826&json=true","fetch_graph":"https://pith.science/api/pith-number/YEK2EB3UZILCNPFUT75SKOASPQ/graph.json","fetch_events":"https://pith.science/api/pith-number/YEK2EB3UZILCNPFUT75SKOASPQ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ/action/storage_attestation","attest_author":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ/action/author_attestation","sign_citation":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ/action/citation_signature","submit_replication":"https://pith.science/pith/YEK2EB3UZILCNPFUT75SKOASPQ/action/replication_record"}},"created_at":"2026-05-18T03:19:23.485470+00:00","updated_at":"2026-05-18T03:19:23.485470+00:00"}