OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

· 2026 · cs.CV · arXiv 2604.20806

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

representative citing papers

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.

citing papers explorer

Showing 1 of 1 citing paper after filters.

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

fields

years

verdicts

representative citing papers

citing papers explorer