{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:WQLLX4OMUVD4PAVZ72W5IY65L3","short_pith_number":"pith:WQLLX4OM","schema_version":"1.0","canonical_sha256":"b416bbf1cca547c782b9feadd463dd5ee863fd5f73a381dd348d67f0b449ab90","source":{"kind":"arxiv","id":"2311.16502","version":4},"attestation_state":"computed","paper":{"title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.","cross_cats":["cs.AI","cs.CV"],"primary_cat":"cs.CL","authors_text":"Botao Yu, Boyuan Zheng, Cong Wei, Dongfu Jiang, Ge Zhang, Huan Sun, Kai Zhang, Ming Yin, Renliang Sun, Ruibin Yuan, Ruoqi Liu, Samuel Stevens, Tianyu Zheng, Weiming Ren, Wenhao Huang, Wenhu Chen, Xiang Yue, Yibo Liu, Yuansheng Ni, Yu Su, Yuxuan Sun, Zhenzhu Yang","submitted_at":"2023-11-27T17:33:21Z","abstract_excerpt":"We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. U"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2311.16502","kind":"arxiv","version":4},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-11-27T17:33:21Z","cross_cats_sorted":["cs.AI","cs.CV"],"title_canon_sha256":"c676d155268c4b0c7a75a3b5e40ee86f50174544ced223da0e78878e44a7ea68","abstract_canon_sha256":"de0ecfa23bacf26dab6973c29b09c6078f8e05cd01f66e073e06de1205925749"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.375690Z","signature_b64":"9RGGfUz1OyYMsQwkdLOCdbgvGZQEpnFDf4p1t8ndwZ+5SeOMnrXJ0H1sLt/Ww7BNhB2a2ovgNoUJs6lwADA8DQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b416bbf1cca547c782b9feadd463dd5ee863fd5f73a381dd348d67f0b449ab90","last_reissued_at":"2026-05-17T23:38:53.375011Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.375011Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.","cross_cats":["cs.AI","cs.CV"],"primary_cat":"cs.CL","authors_text":"Botao Yu, Boyuan Zheng, Cong Wei, Dongfu Jiang, Ge Zhang, Huan Sun, Kai Zhang, Ming Yin, Renliang Sun, Ruibin Yuan, Ruoqi Liu, Samuel Stevens, Tianyu Zheng, Weiming Ren, Wenhao Huang, Wenhu Chen, Xiang Yue, Yibo Liu, Yuansheng Ni, Yu Su, Yuxuan Sun, Zhenzhu Yang","submitted_at":"2023-11-27T17:33:21Z","abstract_excerpt":"We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. U"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The collected questions and images accurately represent the perception and reasoning demands of college-level expertise across the six disciplines.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"db82426af49414413a3d226e5a137afc0db3f808f6d3fbc011059136fbc29bde"},"source":{"id":"2311.16502","kind":"arxiv","version":4},"verdict":{"id":"58753d9a-79c9-4414-b8ba-70dd7202ea1f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:32:50.610184Z","strongest_claim":"Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement.","one_line_summary":"MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The collected questions and images accurately represent the perception and reasoning demands of college-level expertise across the six disciplines.","pith_extraction_headline":"Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions."},"references":{"count":97,"sample":[{"doi":"","year":2023,"title":"Artificial general intelligence is already here","work_id":"ee91164c-c115-42cf-b76e-7bfc03f754de","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"482765e0-05ff-4ec7-b839-d279de7d5d65","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Lawrence Zitnick, and Devi Parikh","work_id":"4ed9a186-58bd-448e-8d3e-f388ddffa45d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","work_id":"87bfa84a-e663-4165-806f-93ef439d88d0","ref_index":4,"cited_arxiv_id":"2308.01390","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":5,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":97,"snapshot_sha256":"69e920abf0bf85b9da808524ccac4492e07a91e2b06c5897ee526bbd97ace56b","internal_anchors":34},"formal_canon":{"evidence_count":3,"snapshot_sha256":"216551743014b356989930d42f52f907b6419d05685b64948fdbc37d5218c014"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2311.16502","created_at":"2026-05-17T23:38:53.375123+00:00"},{"alias_kind":"arxiv_version","alias_value":"2311.16502v4","created_at":"2026-05-17T23:38:53.375123+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2311.16502","created_at":"2026-05-17T23:38:53.375123+00:00"},{"alias_kind":"pith_short_12","alias_value":"WQLLX4OMUVD4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"WQLLX4OMUVD4PAVZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"WQLLX4OM","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":49,"internal_anchor_count":49,"sample":[{"citing_arxiv_id":"2402.11684","citing_title":"ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models","ref_index":135,"is_internal_anchor":true},{"citing_arxiv_id":"2408.10872","citing_title":"V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2410.14702","citing_title":"Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2502.13923","citing_title":"Qwen2.5-VL Technical Report","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2504.09925","citing_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18141","citing_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23678","citing_title":"Grounded Reinforcement Learning for Visual Reasoning","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2508.06038","citing_title":"Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18141","citing_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2508.19652","citing_title":"Self-Rewarding Vision-Language Model via Reasoning Decomposition","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2402.14804","citing_title":"Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2511.15578","citing_title":"AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14998","citing_title":"FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2407.03320","citing_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","ref_index":167,"is_internal_anchor":true},{"citing_arxiv_id":"2401.16420","citing_title":"InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21678","citing_title":"Agentic Learner with Grow-and-Refine Multimodal Semantic Memory","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2403.14624","citing_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2406.09411","citing_title":"MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2407.01284","citing_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2512.15567","citing_title":"Evaluating Large Language Models in Scientific Discovery","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13232","citing_title":"PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2403.09611","citing_title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","ref_index":128,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":130,"is_internal_anchor":true},{"citing_arxiv_id":"2404.12390","citing_title":"BLINK: Multimodal Large Language Models Can See but Not Perceive","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2401.01614","citing_title":"GPT-4V(ision) is a Generalist Web Agent, if Grounded","ref_index":26,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3","json":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3.json","graph_json":"https://pith.science/api/pith-number/WQLLX4OMUVD4PAVZ72W5IY65L3/graph.json","events_json":"https://pith.science/api/pith-number/WQLLX4OMUVD4PAVZ72W5IY65L3/events.json","paper":"https://pith.science/paper/WQLLX4OM"},"agent_actions":{"view_html":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3","download_json":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3.json","view_paper":"https://pith.science/paper/WQLLX4OM","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2311.16502&json=true","fetch_graph":"https://pith.science/api/pith-number/WQLLX4OMUVD4PAVZ72W5IY65L3/graph.json","fetch_events":"https://pith.science/api/pith-number/WQLLX4OMUVD4PAVZ72W5IY65L3/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3/action/timestamp_anchor","attest_storage":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3/action/storage_attestation","attest_author":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3/action/author_attestation","sign_citation":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3/action/citation_signature","submit_replication":"https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3/action/replication_record"}},"created_at":"2026-05-17T23:38:53.375123+00:00","updated_at":"2026-05-17T23:38:53.375123+00:00"}