pith. sign in
Pith Number

pith:XD6Y6EP7

pith:2024:XD6Y6EP7RYKT4SMQC4CJJBPCNK
not attested not anchored not stored refs resolved

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Chaowei Xiao, Chunyuan Li, Dan Roth, Fei Wang, Hoifung Poon, Hsiang-Hui Liu, James Y. Huang, Kai-Wei Chang, Kai Zhang, Mingyu Derek Ma, Muhao Chen, Nan Xu, Pan Lu, Qin Liu, Sheng Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Wenxuan Zhou, Xiaogeng Liu, Xingyu Fu, Zekun Li

MuirBench reveals that even leading multimodal LLMs like GPT-4o achieve only 68 percent accuracy on multi-image tasks.

arxiv:2406.09411 v2 · 2024-06-13 · cs.CV · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{XD6Y6EP7RYKT4SMQC4CJJBPCNK}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy.

C2weakest assumption

The assumption that each standard instance paired with an unanswerable variant has only minimal semantic differences and that this pairing reliably isolates multi-image understanding without introducing new biases or artifacts in question construction.

C3one line summary

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

References

72 extracted · 72 resolved · 13 Pith anchors

[1] Flamingo: a visual language model for few-shot learning 2022
[2] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models 2023 · arXiv:2308.01390
[3] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 2023
[4] Visual question answering on image sets 2020
[5] Language models are few-shot learners 1901

Cited by

23 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.018797Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b8fd8f11ff8e153e499017049485e26ab991852fb84c3b7a9514944acb09a738

Aliases

arxiv: 2406.09411 · arxiv_version: 2406.09411v2 · doi: 10.48550/arxiv.2406.09411 · pith_short_12: XD6Y6EP7RYKT · pith_short_16: XD6Y6EP7RYKT4SMQ · pith_short_8: XD6Y6EP7
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/XD6Y6EP7RYKT4SMQC4CJJBPCNK \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b8fd8f11ff8e153e499017049485e26ab991852fb84c3b7a9514944acb09a738
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "8d5d7b9915588fe1d70c0c7a1399795ba30490968aba4c01fe00fe7d034964be",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-06-13T17:59:52Z",
    "title_canon_sha256": "7305f5e7a2333182d246e61898bc65d3ee91cf09ceb7f4a464b2086a045c3a99"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.09411",
    "kind": "arxiv",
    "version": 2
  }
}