pith. sign in
Pith Number

pith:4O7HQCVO

pith:2024:4O7HQCVOBKXK7AXX7NVR7JHMEM
not attested not anchored not stored refs resolved

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Chaoyou Fu, Feng Li, Haochen Tian, Huanyu Zhang, Junfei Wu, Kun Wang, Liang Wang, Qingsong Wen, Rong Jin, Shuangqing Zhang, Tieniu Tan, Yi-Fan Zhang, Zhang Zhang

Even the strongest multimodal LLMs fail to reach 60 percent accuracy on high-resolution real-world tasks

arxiv:2408.13257 v3 · 2024-08-23 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{4O7HQCVOBKXK7AXX7NVR7JHMEM}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

even the most advanced models struggle with our benchmarks, where none of them reach 60% accuracy

C2weakest assumption

The 13,366 filtered images and 29,429 QA pairs created by 25 annotators and 7 experts truly represent high-resolution real-world scenarios that are extremely challenging even for humans

C3one line summary

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

References

102 extracted · 102 resolved · 30 Pith anchors

[1] Ntire 2017 challenge on single image super-resolution: Dataset and study 2017
[2] PaLM 2 Technical Report 2023 · arXiv:2305.10403
[3] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models 2023 · arXiv:2308.01390
[4] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[5] TouchStone: Evaluating vision-language models by language models 2023

Formal links

1 machine-checked theorem link

Cited by

37 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.584764Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

e3be780aae0aaeaf82f7fb6b1fa4ec232acef96c97153915e475c39bf8505b35

Aliases

arxiv: 2408.13257 · arxiv_version: 2408.13257v3 · doi: 10.48550/arxiv.2408.13257 · pith_short_12: 4O7HQCVOBKXK · pith_short_16: 4O7HQCVOBKXK7AXX · pith_short_8: 4O7HQCVO
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/4O7HQCVOBKXK7AXX7NVR7JHMEM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e3be780aae0aaeaf82f7fb6b1fa4ec232acef96c97153915e475c39bf8505b35
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "5d3bbb8b38c0d16507887f6e562134f837670dcf21be01c1811979ce43518d33",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-08-23T17:59:51Z",
    "title_canon_sha256": "b579ae444d9a4bd2c060336112ea901912da858435643d1dae227d8eabb9fa89"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2408.13257",
    "kind": "arxiv",
    "version": 3
  }
}