{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:JO2LRJNGHJ24KLDBJPAXPJVIWT","short_pith_number":"pith:JO2LRJNG","schema_version":"1.0","canonical_sha256":"4bb4b8a5a63a75c52c614bc177a6a8b4f2ef75e3d32702851aad90c82f4dce44","source":{"kind":"arxiv","id":"2311.17005","version":4},"attestation_state":"computed","paper":{"title":"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Guo Chen, Jilan Xu, Kunchang Li, Limin Wang, Ping Luo, Yali Wang, Yi Liu, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao, Zun Wang","submitted_at":"2023-11-28T17:59:04Z","abstract_excerpt":"With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a nove"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2311.17005","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2023-11-28T17:59:04Z","cross_cats_sorted":[],"title_canon_sha256":"e9370fc3ea60975756dc5270470f9babbffcc54e5543ad4c496f5afa89d98774","abstract_canon_sha256":"daea546185557c858111e16434b39fef7760fcb87d5ec7c76147f5c613d518f3"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.190159Z","signature_b64":"pL8CNvDceL9+Mq+wYzIEq0Skm50ReAZ3tP9jvyL7p8GbiUS+Qt6vrCFlozCLg2mIapGSCGlqnHNFBm3Q20rYCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"4bb4b8a5a63a75c52c614bc177a6a8b4f2ef75e3d32702851aad90c82f4dce44","last_reissued_at":"2026-05-17T23:38:13.189427Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.189427Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Guo Chen, Jilan Xu, Kunchang Li, Limin Wang, Ping Luo, Yali Wang, Yi Liu, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao, Zun Wang","submitted_at":"2023-11-28T17:59:04Z","abstract_excerpt":"With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a nove"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That automatically converting public video annotations into multiple-choice QA pairs accurately measures the intended temporal skills without introducing annotation biases or allowing single-frame shortcuts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"cf27bad3fd342403d7194c15b35af731d434653f4e74bdcddfd284fdeb256219"},"source":{"id":"2311.17005","kind":"arxiv","version":4},"verdict":{"id":"0422e78a-9fbd-4176-be0e-57845c776662","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T20:18:29.561336Z","strongest_claim":"the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench.","one_line_summary":"MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That automatically converting public video annotations into multiple-choice QA pairs accurately measures the intended temporal skills without introducing annotation biases or allowing single-frame shortcuts.","pith_extraction_headline":"Most multi-modal AI models fail at temporal understanding in videos, but a new benchmark and training method lift performance by more than 15 percent."},"references":{"count":104,"sample":[{"doi":"","year":2022,"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","ref_index":1,"cited_arxiv_id":"2204.14198","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":2,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2021,"title":"Frozen in time: A joint video and image encoder for end-to-end retrieval","work_id":"12562377-293a-4224-b83e-3f411bc1cd94","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Ali Furkan Biten, Rub `en P ´erez Tito, Andr ´es Mafla, Llu ´ıs G´omez, Marc ¸al Rusi˜nol, Ernest Valveny, C. V . Jawahar, and Dimosthenis Karatzas. Scene text visual question answer- ing. In ICCV, 20","work_id":"1b934981-5385-4e63-b823-9601505710bd","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Language models are few-shot learners","work_id":"82a86dd0-f3b8-4511-97b1-c7b263281e6e","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":104,"snapshot_sha256":"5450571d31559e0a4e997ba4d8202006403e66cc39b1edf8e2dcf00090e1818c","internal_anchors":24},"formal_canon":{"evidence_count":2,"snapshot_sha256":"86fe5896407563c1307107fb80b8f1d601edf9aa3aa1ca361ebf46b5ec3c871f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2311.17005","created_at":"2026-05-17T23:38:13.189554+00:00"},{"alias_kind":"arxiv_version","alias_value":"2311.17005v4","created_at":"2026-05-17T23:38:13.189554+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2311.17005","created_at":"2026-05-17T23:38:13.189554+00:00"},{"alias_kind":"pith_short_12","alias_value":"JO2LRJNGHJ24","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"JO2LRJNGHJ24KLDB","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"JO2LRJNG","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":16,"internal_anchor_count":16,"sample":[{"citing_arxiv_id":"2407.03320","citing_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2403.00476","citing_title":"TempCompass: Do Video LLMs Really Understand Videos?","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14724","citing_title":"HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2404.16994","citing_title":"PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2504.01805","citing_title":"SpaceR: Reinforcing MLLMs in Video Spatial Reasoning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27259","citing_title":"Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2406.04264","citing_title":"MLVU: Benchmarking Multi-task Long Video Understanding","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02467","citing_title":"VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27083","citing_title":"Co-Evolving Policy Distillation","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25186","citing_title":"FCMBench-Video: Benchmarking Document Video Intelligence","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03351","citing_title":"VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08703","citing_title":"QoS-QoE Translation with Large Language Model","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08077","citing_title":"AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05225","citing_title":"MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13106","citing_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16893","citing_title":"EasyVideoR1: Easier RL for Video Understanding","ref_index":19,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT","json":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT.json","graph_json":"https://pith.science/api/pith-number/JO2LRJNGHJ24KLDBJPAXPJVIWT/graph.json","events_json":"https://pith.science/api/pith-number/JO2LRJNGHJ24KLDBJPAXPJVIWT/events.json","paper":"https://pith.science/paper/JO2LRJNG"},"agent_actions":{"view_html":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT","download_json":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT.json","view_paper":"https://pith.science/paper/JO2LRJNG","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2311.17005&json=true","fetch_graph":"https://pith.science/api/pith-number/JO2LRJNGHJ24KLDBJPAXPJVIWT/graph.json","fetch_events":"https://pith.science/api/pith-number/JO2LRJNGHJ24KLDBJPAXPJVIWT/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT/action/timestamp_anchor","attest_storage":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT/action/storage_attestation","attest_author":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT/action/author_attestation","sign_citation":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT/action/citation_signature","submit_replication":"https://pith.science/pith/JO2LRJNGHJ24KLDBJPAXPJVIWT/action/replication_record"}},"created_at":"2026-05-17T23:38:13.189554+00:00","updated_at":"2026-05-17T23:38:13.189554+00:00"}