Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

Jing Liu; Longteng Guo; Pengkang Huo; Tailai Chen; Xinxin Zhu; Yifan Wang; Yuze Wu

arxiv: 2605.25364 · v1 · pith:4JY2ISWNnew · submitted 2026-05-25 · 💻 cs.CV

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

Longteng Guo , Yifan Wang , Pengkang Huo , Tailai Chen , Yuze Wu , Jing Liu , Xinxin Zhu This is my paper

Pith reviewed 2026-06-29 22:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsvision-centric reasoningvisual reasoning benchmarkMLLM evaluationperceptual reasoningstructural reasoningconceptual reasoning

0 comments

The pith

VisReason benchmark shows current MLLMs fall short on reasoning that requires direct visual evidence over language shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VisReason as a benchmark of 1,505 questions in 10 categories that tests reasoning where visual perception and inference must work together in everyday scenes. Evaluation results indicate that existing MLLMs show large performance gaps compared with humans and gain little from test-time reasoning techniques such as chain-of-thought. A sympathetic reader would care because the benchmark is designed to block solutions based on textual priors, revealing whether reported visual-reasoning gains are genuine or illusory. The work positions VisReason as a diagnostic tool rather than another general VQA set, focusing on perceptual, structural, and conceptual categories that couple seeing and thinking.

Core claim

VisReason is a benchmark for vision-centric reasoning in which perception and inference are tightly coupled, containing 1,505 questions across 10 categories that span perceptual, structural, and conceptual levels. The evaluation demonstrates that this benchmark poses a qualitatively different challenge from prior suites, with current MLLMs exhibiting substantial gaps relative to human performance and showing only limited benefits from test-time reasoning strategies.

What carries the argument

The VisReason benchmark itself, whose 1,505 questions and 10 categories are constructed to require reasoning grounded in visual evidence rather than solvable through language shortcuts.

If this is right

MLLMs do not yet perform vision-centric reasoning at human levels when language shortcuts are removed.
Test-time reasoning strategies such as chain-of-thought yield only marginal improvements on these tasks.
Existing visual-reasoning benchmarks likely overestimate model capabilities by permitting solutions based on textual priors.
Progress on vision-centric reasoning will require training or evaluation methods that enforce tighter coupling between perception and inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to redesign training objectives to reward direct use of visual features rather than language-model fallback.
VisReason could serve as a recurring diagnostic to measure whether new architectures close the human-model gap on coupled perception-inference tasks.
Similar benchmark designs might expose comparable gaps in other modalities such as video or spatial reasoning where language cues are also abundant.

Load-bearing premise

The questions succeed in forcing reasoning that depends on visual evidence and cannot be answered from language patterns or prior textual knowledge alone.

What would settle it

A controlled experiment in which an MLLM reaches human-level accuracy on the VisReason questions while test-time reasoning methods produce large score gains, yet the same model still solves the questions when visual input is removed or corrupted.

Figures

Figures reproduced from arXiv: 2605.25364 by Jing Liu, Longteng Guo, Pengkang Huo, Tailai Chen, Xinxin Zhu, Yifan Wang, Yuze Wu.

**Figure 2.** Figure 2: VisReason examples from different reasoning categories, covering everyday visual reasoning scenarios [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Domain diversity in VisReason. The benchmark covers 10 reasoning categories and 36 fine-grained [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling behavior of models on VisReason. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of COT prompting on GPT-4o perfor [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Distribution of error types for Qwen3-VL [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: shows the distribution of questions across the 10 reasoning categories in VisReason. The dataset is designed to span a wide range of visioncentric reasoning phenomena, covering perceptual grounding, structured spatial reasoning, and higherlevel inference. Conceptual reasoning categories capture abstract inference based on visual evidence, while structural categories focus on spatial relations, rule-base… view at source ↗

**Figure 9.** Figure 9: Token length and accuracy across categories for different models. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: An example from the Spot the Difference category generated by our automatic update framework. Left: Original image. Right: Generated image with one grape removed. • Sudoku Solving. This category can be updated with an automatic puzzle generator. • Cue Insight, Inductive Reasoning, and Deductive Reasoning. These categories can be updated from civil service exams and visual puzzle websites with model-assi… view at source ↗

**Figure 11.** Figure 11: OCR Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Counting Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Visual Parsing Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Spatial Orientation Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Grounding Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Wrong Rule case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Missed Steps case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Calculation Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Format Error case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: Hallucination case of Qwen3-VL-235B-A22B-Thinking. [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

read the original abstract

Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisReason is a new benchmark trying to isolate vision-grounded reasoning, but its core claim needs text-only controls to stand up.

read the letter

The main point on this paper is that VisReason puts forward a new set of 1,505 questions in 10 categories meant to test reasoning that actually depends on visual input in everyday scenes, rather than language patterns or priors. It reports gaps with humans and limited gains from test-time methods.

The work does a reasonable job of naming a practical problem in current MLLM benchmarks, where models can often succeed without using the image. The focus on tightly coupled perception-inference tasks and the split into perceptual, structural, and conceptual categories is a clear attempt to make the evaluation more diagnostic.

The soft spot is exactly the one in the stress-test note. Nothing in the abstract shows text-only baselines, human text-only accuracy, or explicit checks that the questions cannot be solved from the question text alone. Without those, the performance gaps could still come from language shortcuts, which weakens the assertion that this is a qualitatively different challenge. The abstract gives no details on question construction or vetting either.

The paper engages directly with the existing literature on MLLM limitations and presents concrete numbers and categories. That is enough to make it worth a look for people working on multimodal evaluation.

I would bring this to a reading group as a maybe, mainly to discuss benchmark design choices. It is not something I would cite soon because of the missing controls. It deserves peer review because a properly validated benchmark in this area would be useful, even if the current version needs work on the language-leakage issue.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces VisReason, a benchmark containing 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning in everyday scenarios. It claims that current MLLMs exhibit substantial gaps relative to humans on this benchmark and show limited gains from test-time reasoning strategies, positioning VisReason as a diagnostic for vision-centric reasoning that existing benchmarks fail to capture.

Significance. If the benchmark construction succeeds in isolating reasoning that requires visual evidence rather than textual inference or priors, the work would supply a useful targeted evaluation set for diagnosing integration failures in MLLMs and for measuring progress beyond language-only shortcuts.

major comments (2)

[Abstract / Benchmark Design] Abstract and benchmark construction: the assertion that VisReason poses a 'qualitatively different challenge' and that performance gaps reflect vision-centric reasoning deficits rests on the unverified premise that the 1,505 questions cannot be solved from text alone. No text-only LLM baselines, human text-only accuracy, or explicit controls for prior-knowledge leakage are reported, directly undermining the central claim.
[Evaluation] Evaluation section: the abstract states that evaluation reveals gaps and limited test-time gains, yet the provided text supplies no methods, data splits, statistical details, full result tables, or per-category breakdowns. Without these, the reported human-MLLM gaps and strategy comparisons cannot be assessed for soundness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of benchmark validation and evaluation transparency. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Benchmark Design] Abstract and benchmark construction: the assertion that VisReason poses a 'qualitatively different challenge' and that performance gaps reflect vision-centric reasoning deficits rests on the unverified premise that the 1,505 questions cannot be solved from text alone. No text-only LLM baselines, human text-only accuracy, or explicit controls for prior-knowledge leakage are reported, directly undermining the central claim.

Authors: We agree that explicit verification is necessary to support the claim that VisReason isolates vision-centric reasoning. In the revised manuscript, we will add text-only LLM baselines using multiple models on question text alone, report human accuracy on text-only versions of the questions, and include analysis for prior-knowledge leakage. These additions will directly test whether the questions can be solved without visual input. revision: yes
Referee: [Evaluation] Evaluation section: the abstract states that evaluation reveals gaps and limited test-time gains, yet the provided text supplies no methods, data splits, statistical details, full result tables, or per-category breakdowns. Without these, the reported human-MLLM gaps and strategy comparisons cannot be assessed for soundness.

Authors: We acknowledge that the submitted version lacked sufficient detail in the evaluation section. The revised manuscript will expand this section to include full methodological descriptions, data split information, statistical details (including confidence intervals and significance tests), complete result tables, and per-category performance breakdowns for both MLLMs and humans, enabling readers to fully assess the reported gaps and strategy comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark dataset construction with no derivations or self-referential reductions

full rationale

This is a benchmark introduction paper whose central contribution is the creation and initial evaluation of the VisReason question set (1,505 questions, 10 categories). No equations, parameter fitting, or derivation chain exists in the provided text. The claim that the benchmark poses a 'qualitatively different challenge' rests on the authors' design choices for question content rather than any reduction to prior fitted values or self-citations. The skeptic concern about missing text-only baselines is a validity issue, not a circularity issue. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's central addition is the benchmark itself; its validity rests on the domain assumption that the chosen questions isolate vision-centric reasoning.

axioms (1)

domain assumption The 10 categories and 1,505 questions isolate reasoning grounded in visual evidence from language-based shortcuts.
This premise underpins the claim that VisReason poses a qualitatively different challenge.

pith-pipeline@v0.9.1-grok · 5655 in / 1037 out tokens · 31612 ms · 2026-06-29T22:42:47.685643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages

[1]

Kwai keye-vl 1.5 technical report,

Kwai keye-vl 1.5 technical report.Preprint, arXiv:2509.01563. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, and 3 others. 2024. Mmmu: A massive multi-discipline multimo...

work page arXiv 2024
[2]

zig-zag\

**Solid black circle** 4. **Empty square** 5. **Two small dots side by side** 6. **Octagon (white)** These faces are arranged in the following pattern: This is a \"zig-zag\" type of net. When folded into a cube, the adjacency relationships can be deduced as follows: - **Frame** is adjacent to **Stripes** - **Stripes** is adjacent to **Frame**, **Black** -...

[1] [1]

Kwai keye-vl 1.5 technical report,

Kwai keye-vl 1.5 technical report.Preprint, arXiv:2509.01563. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, and 3 others. 2024. Mmmu: A massive multi-discipline multimo...

work page arXiv 2024

[2] [2]

zig-zag\

**Solid black circle** 4. **Empty square** 5. **Two small dots side by side** 6. **Octagon (white)** These faces are arranged in the following pattern: This is a \"zig-zag\" type of net. When folded into a cube, the adjacency relationships can be deduced as follows: - **Frame** is adjacent to **Stripes** - **Stripes** is adjacent to **Frame**, **Black** -...