pith. sign in
Pith Number

pith:WQLLX4OM

pith:2023:WQLLX4OMUVD4PAVZ72W5IY65L3
not attested not anchored not stored refs resolved

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Botao Yu, Boyuan Zheng, Cong Wei, Dongfu Jiang, Ge Zhang, Huan Sun, Kai Zhang, Ming Yin, Renliang Sun, Ruibin Yuan, Ruoqi Liu, Samuel Stevens, Tianyu Zheng, Weiming Ren, Wenhao Huang, Wenhu Chen, Xiang Yue, Yibo Liu, Yuansheng Ni, Yu Su, Yuxuan Sun, Zhenzhu Yang

Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.

arxiv:2311.16502 v4 · 2023-11-27 · cs.CL · cs.AI · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WQLLX4OMUVD4PAVZ72W5IY65L3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement.

C2weakest assumption

The collected questions and images accurately represent the perception and reasoning demands of college-level expertise across the six disciplines.

C3one line summary

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

References

97 extracted · 97 resolved · 34 Pith anchors

[1] Artificial general intelligence is already here 2023
[2] Flamingo: a visual language model for few-shot learning 2022
[3] Lawrence Zitnick, and Devi Parikh 2015
[4] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models 2023 · arXiv:2308.01390
[5] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

Formal links

3 machine-checked theorem links

Cited by

49 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:53.375011Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b416bbf1cca547c782b9feadd463dd5ee863fd5f73a381dd348d67f0b449ab90

Aliases

arxiv: 2311.16502 · arxiv_version: 2311.16502v4 · doi: 10.48550/arxiv.2311.16502 · pith_short_12: WQLLX4OMUVD4 · pith_short_16: WQLLX4OMUVD4PAVZ · pith_short_8: WQLLX4OM
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WQLLX4OMUVD4PAVZ72W5IY65L3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b416bbf1cca547c782b9feadd463dd5ee863fd5f73a381dd348d67f0b449ab90
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "de0ecfa23bacf26dab6973c29b09c6078f8e05cd01f66e073e06de1205925749",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CV"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-11-27T17:33:21Z",
    "title_canon_sha256": "c676d155268c4b0c7a75a3b5e40ee86f50174544ced223da0e78878e44a7ea68"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.16502",
    "kind": "arxiv",
    "version": 4
  }
}