arxiv: 2604.25122 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.AI

Recognition: unknown

M³-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

Jiatong Ma , Longteng Guo , Yuchen Liu , Zijia Zhao , Dongze Hao , Xuanxu Lin , Jing Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal VQAmulti-hop reasoningmulti-entity questionsMLLMsknowledge-based VQAvisual question answeringbenchmarkentity understanding

0 comments

The pith

M³-VQA benchmark shows current multimodal models struggle with questions that link multiple entities across images and text through multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents M³-VQA as a new benchmark that tests multimodal large language models on knowledge-based visual questions involving several distinct entities drawn from both visual and textual sources. Existing VQA datasets typically limit themselves to single-entity facts and short inference chains, so this work creates questions that demand sequential and parallel reasoning steps supported by traceable evidence from a curated multimodal knowledge base. Tests on 16 leading models find consistently low performance when no external information is supplied, with clear gains once precise evidence is provided and further gains when retrieval follows explicit reasoning steps. These outcomes indicate that current models lack robust mechanisms for acquiring and chaining fine-grained multimodal knowledge.

Core claim

M³-VQA is a knowledge-based VQA benchmark featuring diverse multi-entity questions that require models to perform both sequential and parallel multi-hop reasoning across multiple documents drawn from visual and textual sources, backed by a curated multimodal knowledge base and detailed traceable evidence. Evaluation of 16 leading MLLMs under three input conditions shows poor results without external knowledge, marked improvement when gold evidence is supplied, and additional gains when reasoning-aware agentic retrieval is used instead of heuristic methods.

What carries the argument

The M³-VQA benchmark of multi-entity multi-hop questions paired with a curated multimodal knowledge base that supplies traceable evidence for each required reasoning step.

If this is right

MLLMs need stronger built-in mechanisms for acquiring and integrating knowledge from both visual and textual sources before they can handle multi-entity questions reliably.
Retrieval systems that follow explicit reasoning steps outperform simple heuristic search on these tasks.
Performance gaps narrow sharply once precise evidence is supplied, pointing to acquisition rather than pure reasoning as the primary bottleneck.
Future model development can use the benchmark's traceable evidence structure to diagnose exactly where multi-hop chains break down.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that reward explicit step-by-step chaining over multimodal sources may close the observed performance gap faster than scale alone.
The same multi-entity multi-hop structure could be adapted to video or audio domains to test whether the same acquisition and reasoning weaknesses appear there.
If models trained on M³-VQA-style data later generalize to open-ended real-world queries, it would suggest the benchmark captures transferable reasoning skills rather than narrow test artifacts.

Load-bearing premise

That the questions and knowledge base created for the benchmark accurately reflect the fine-grained entity understanding and complex reasoning demands of real-world multimodal tasks without adding artificial simplifications or biases.

What would settle it

If leading MLLMs achieve high accuracy on the full M³-VQA test set when given only the image and question and no external documents or retrieval, the claim of significant gaps in knowledge acquisition and multi-hop reasoning would be undermined.

Figures

Figures reproduced from arXiv: 2604.25122 by Dongze Hao, Jiatong Ma, Jing Liu, Longteng Guo, Xuanxu Lin, Yuchen Liu, Zijia Zhao.

**Figure 1.** Figure 1: Examples of parallel and sequential multi-hop reasoning in M view at source ↗

**Figure 2.** Figure 2: Model accuracy across different hop and entity counts, under the sentence evidence setting. view at source ↗

**Figure 3.** Figure 3: Complexity of M3 -VQA view at source ↗

**Figure 4.** Figure 4: Types of questions covered in M3 -VQA view at source ↗

**Figure 5.** Figure 5: Types of data in M3 -VQA . Q : Where is the highest point of the mountains in which this plant is found? A : Mount Rainier. Complexity : 1-entity, 2-hop Plant Q : Which country is most associated with the origin of the food in the image? A : Italy. Complexity : 2-entity, 2-hop Food Q : Where are this bird and Paradise shelduck native to? A : Australia, New Zealand. Complexity : 2-entity, 2-hop Q : Who are … view at source ↗

**Figure 6.** Figure 6: Examples of M3 -VQA questions across diverse fine-grained entity types. Sentence Section Entity-Name KB Original Q-Only I ✓ ✓ ✓ ✓ ✓ ✗ Q ✓ ✓ ✓ ✓ ✓ ✓ Visual entity name ✓ ✓ ✓ Retrieved ✗ ✗ Evidence Sentence Section ✗ Retrieved ✗ ✗ view at source ↗

**Figure 7.** Figure 7: Schematic diagram and case study of the agentic retrieval model. view at source ↗

read the original abstract

We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M$^3$-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M$^3$-VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at https://github.com/CASIA-IVA-Lab/M3VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M³-VQA shows clear performance gaps when MLLMs lack external knowledge but the results do not yet confirm that questions force genuine multi-entity multi-hop reasoning.

read the letter

The key point with this paper is that it builds a benchmark to test MLLMs on multi-entity multi-hop questions spanning images and text, and shows big performance lifts when models get gold evidence instead of having to retrieve on their own. The new part is the focus on traceable evidence from multiple documents and the comparison of retrieval methods in an agentic setup. Evaluating 16 models across no-knowledge, gold, and retrieval settings gives a decent picture of where things stand. Releasing the dataset and code makes it practical for follow-up work. One soft spot is the missing checks for whether the questions truly demand multi-hop integration. Without tests that supply only partial evidence, like facts about one entity or one reasoning step, it's possible the results reflect direct reading of the provided text rather than enforced complex reasoning. The abstract describes the curation but the stress-test concern about shortcuts holds up based on what's shown. This is useful for people working on multimodal reasoning benchmarks and MLLM improvements. Readers who care about knowledge acquisition in vision-language models will find the numbers and settings relevant. It deserves a serious referee because benchmarks that try to push beyond simple VQA can move the field if the evaluation is rigorous. I would send this to peer review but flag the need for ablations on evidence levels and more validation of question design.

Referee Report

2 major / 3 minor

Summary. The paper introduces M³-VQA, a knowledge-based VQA benchmark targeting fine-grained multimodal entity understanding and complex multi-entity, multi-hop reasoning over visual and textual sources. It constructs a curated multimodal KB with traceable evidence, generates questions requiring sequential and parallel multi-hop inference, and evaluates 16 MLLMs across three settings: no external knowledge, gold evidence, and retrieval-augmented input. Key findings are that models perform poorly without external information, improve substantially with precise evidence, and benefit further from reasoning-aware agentic retrieval over heuristic methods.

Significance. If the questions are validated to require genuine multi-entity multi-hop integration rather than shortcuts, the benchmark would usefully expose limitations in current MLLMs' knowledge acquisition and complex multimodal reasoning, complementing existing VQA datasets. The public release of the dataset and code supports reproducibility and community follow-up work.

major comments (2)

[§4] §4 (Experiments and Evaluation Settings): The central claim that poor results without external information demonstrate 'significant challenges for MLLMs in knowledge acquisition and complex reasoning' rests on the assumption that questions enforce multi-entity multi-hop reasoning. However, no ablation results are reported in which only partial evidence (one entity or one hop) is provided; without these controls, the marked gains under gold evidence could reflect direct lookup rather than enforced cross-entity integration.
[§3] §3 (Benchmark Construction): The description of question generation and the multimodal KB does not include quantitative validation (e.g., distribution of required reasoning hops, inter-annotator agreement on whether single-entity shortcuts suffice, or human performance baselines on partial-evidence variants). This leaves the 'multi-hop' and 'multi-entity' properties of the benchmark under-supported for interpreting model failures.

minor comments (3)

[§1] The abstract and §1 refer to 'sequential and parallel multi-hop reasoning' without a concrete example or diagram distinguishing the two modes; adding one would improve clarity.
[§4] Table reporting the 16 models' results should include per-setting breakdowns and statistical significance tests for the claimed improvements under gold evidence vs. retrieval.
[§2] Related-work section should more explicitly contrast M³-VQA against prior multi-hop VQA benchmarks (e.g., those using text-only or single-image sources) to highlight the multimodal multi-entity novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful feedback on our paper introducing M³-VQA. The comments highlight important aspects for validating the benchmark's complexity. We provide point-by-point responses to the major comments and commit to revisions that include the suggested ablations and quantitative analyses.

read point-by-point responses

Referee: [§4] §4 (Experiments and Evaluation Settings): The central claim that poor results without external information demonstrate 'significant challenges for MLLMs in knowledge acquisition and complex reasoning' rests on the assumption that questions enforce multi-entity multi-hop reasoning. However, no ablation results are reported in which only partial evidence (one entity or one hop) is provided; without these controls, the marked gains under gold evidence could reflect direct lookup rather than enforced cross-entity integration.

Authors: We agree that ablations with partial evidence are crucial to validate that the performance improvements stem from multi-entity multi-hop reasoning rather than simpler lookups. In the revised manuscript, we will add new experiments providing models with only one entity's information or one reasoning hop. These controls will demonstrate the necessity of full integration for solving the questions, thereby reinforcing our claims about the challenges in complex multimodal reasoning. revision: yes
Referee: [§3] §3 (Benchmark Construction): The description of question generation and the multimodal KB does not include quantitative validation (e.g., distribution of required reasoning hops, inter-annotator agreement on whether single-entity shortcuts suffice, or human performance baselines on partial-evidence variants). This leaves the 'multi-hop' and 'multi-entity' properties of the benchmark under-supported for interpreting model failures.

Authors: We acknowledge that explicit quantitative validation would provide stronger evidence for the benchmark's properties. We will revise the manuscript to include: statistics on the distribution of reasoning hops and entities per question; inter-annotator agreement metrics assessing whether questions can be solved via single-entity shortcuts; and human performance results on partial-evidence variants. These additions will be based on further annotation and evaluation, directly addressing the concern. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluation are independent of self-referential fitting or derivations.

full rationale

The paper introduces a new VQA benchmark with curated multimodal KB, multi-entity questions, and three evaluation settings (no external knowledge, gold evidence, retrieval). It reports empirical results on 16 MLLMs without any parameter fitting, predictive modeling, or derivation steps that reduce to the inputs by construction. Claims about model challenges rest on observed performance gaps, not on equations or self-citations that enforce the outcome. No load-bearing self-citation chains or ansatzes are present; the work is self-contained as an external evaluation benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, fitted parameters, or postulated theoretical entities. The knowledge base is a curated resource rather than an invented construct with independent evidence.

pith-pipeline@v0.9.0 · 5544 in / 1127 out tokens · 81602 ms · 2026-05-07T17:04:03.903810+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 2 canonical work pages

[1]

InComputer vision–ECCV 2014: 13th European conference, zurich, Switzer- land, September 6-12, 2014, proceedings, part VI 13, pages 446–461

Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzer- land, September 6-12, 2014, proceedings, part VI 13, pages 446–461. Springer. Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Ge- off Brown, Bryan A Plummer, Kate Saenko, Jianmo Ni, and Mandy Guo. 2023. A suite of genera...

work page arXiv 2014
[2]

Oktoberfest food dataset.arXiv preprint arXiv:1912.05007. A Additional Related Work In addition to the related work discussed in Section 2, there are several more relevant areas: A.1 Visual Question Answering Visual Question Answering refers to answering questions about images. Early VQA datasets such as DAQUAR (Malinowski and Fritz, 2014), GQA (Hudson an...

work page arXiv 1912
[3]

bridge entities

introduced the concept of "bridge entities" to construct multi-hop QA datasets. For example, in the question, “When was the singer and songwriter of Radiohead born?”, one must first infer that the singer and songwriter of Radiohead is Thom Yorke, and then retrieve his birth date. Here, Thom Yorke serves as the bridge entity. We adopt this concept in const...

2018
[4]

What kind of lion is on the left side of the pic- ture?

unified labels from these datasets under Wikipedia’s 6 million entities, proposing the task of Open-domain Visual Entity Recognition (OVEN), which requires models to link images to Wikipedia entities based on textual queries. These queries can be viewed as simple VQA tasks without detailed attribute reasoning. Unfortunately, OVEN lacks images involving mu...

2022
[5]

Represent this sentence for searching relevant passages:

Under the condition of S= 1 , the distribution ofIPandT Pin the dataset is shown in Table 8. E.3 Question Type We conducted heuristic identification for each ques- tion type in the dataset. To identify question types, we performed hierarchical parsing on each ques- tion. Each question sentence was split by spaces (interpreted as words or phrases), and up ...

2024
[6]

ViT-MLP-LLM

Based on the Qwen2.5-VL series, this model (Bai et al., 2025) has been optimized us- ing reinforcement learning for responses that better align with human subjective preferences, enhanc- ing mathematical reasoning, fine-grained image understanding, and reasoning capabilities. Qwen2-VLReleased in August 2024. This model (Wang et al., 2024) is Capable of un...

2025
[7]

Text" refers to text-based retrieval, “Image

employs BERT-based matching for evalu- ation, which is computationally expensive, espe- cially for our larger dataset (13K items) and nu- merous candidate answers. Some works also use GPT-based evaluation, but due to the number of candidate answers in our setting, this approach is costly and less stable. G.4 Quality of the IoU Metric To verify that our Io...

2024