Recognition: unknown
M³-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering
Pith reviewed 2026-05-07 17:04 UTC · model grok-4.3
The pith
M³-VQA benchmark shows current multimodal models struggle with questions that link multiple entities across images and text through multi-hop reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
M³-VQA is a knowledge-based VQA benchmark featuring diverse multi-entity questions that require models to perform both sequential and parallel multi-hop reasoning across multiple documents drawn from visual and textual sources, backed by a curated multimodal knowledge base and detailed traceable evidence. Evaluation of 16 leading MLLMs under three input conditions shows poor results without external knowledge, marked improvement when gold evidence is supplied, and additional gains when reasoning-aware agentic retrieval is used instead of heuristic methods.
What carries the argument
The M³-VQA benchmark of multi-entity multi-hop questions paired with a curated multimodal knowledge base that supplies traceable evidence for each required reasoning step.
If this is right
- MLLMs need stronger built-in mechanisms for acquiring and integrating knowledge from both visual and textual sources before they can handle multi-entity questions reliably.
- Retrieval systems that follow explicit reasoning steps outperform simple heuristic search on these tasks.
- Performance gaps narrow sharply once precise evidence is supplied, pointing to acquisition rather than pure reasoning as the primary bottleneck.
- Future model development can use the benchmark's traceable evidence structure to diagnose exactly where multi-hop chains break down.
Where Pith is reading between the lines
- Training regimes that reward explicit step-by-step chaining over multimodal sources may close the observed performance gap faster than scale alone.
- The same multi-entity multi-hop structure could be adapted to video or audio domains to test whether the same acquisition and reasoning weaknesses appear there.
- If models trained on M³-VQA-style data later generalize to open-ended real-world queries, it would suggest the benchmark captures transferable reasoning skills rather than narrow test artifacts.
Load-bearing premise
That the questions and knowledge base created for the benchmark accurately reflect the fine-grained entity understanding and complex reasoning demands of real-world multimodal tasks without adding artificial simplifications or biases.
What would settle it
If leading MLLMs achieve high accuracy on the full M³-VQA test set when given only the image and question and no external documents or retrieval, the claim of significant gaps in knowledge acquisition and multi-hop reasoning would be undermined.
Figures
read the original abstract
We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M$^3$-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M$^3$-VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at https://github.com/CASIA-IVA-Lab/M3VQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces M³-VQA, a knowledge-based VQA benchmark targeting fine-grained multimodal entity understanding and complex multi-entity, multi-hop reasoning over visual and textual sources. It constructs a curated multimodal KB with traceable evidence, generates questions requiring sequential and parallel multi-hop inference, and evaluates 16 MLLMs across three settings: no external knowledge, gold evidence, and retrieval-augmented input. Key findings are that models perform poorly without external information, improve substantially with precise evidence, and benefit further from reasoning-aware agentic retrieval over heuristic methods.
Significance. If the questions are validated to require genuine multi-entity multi-hop integration rather than shortcuts, the benchmark would usefully expose limitations in current MLLMs' knowledge acquisition and complex multimodal reasoning, complementing existing VQA datasets. The public release of the dataset and code supports reproducibility and community follow-up work.
major comments (2)
- [§4] §4 (Experiments and Evaluation Settings): The central claim that poor results without external information demonstrate 'significant challenges for MLLMs in knowledge acquisition and complex reasoning' rests on the assumption that questions enforce multi-entity multi-hop reasoning. However, no ablation results are reported in which only partial evidence (one entity or one hop) is provided; without these controls, the marked gains under gold evidence could reflect direct lookup rather than enforced cross-entity integration.
- [§3] §3 (Benchmark Construction): The description of question generation and the multimodal KB does not include quantitative validation (e.g., distribution of required reasoning hops, inter-annotator agreement on whether single-entity shortcuts suffice, or human performance baselines on partial-evidence variants). This leaves the 'multi-hop' and 'multi-entity' properties of the benchmark under-supported for interpreting model failures.
minor comments (3)
- [§1] The abstract and §1 refer to 'sequential and parallel multi-hop reasoning' without a concrete example or diagram distinguishing the two modes; adding one would improve clarity.
- [§4] Table reporting the 16 models' results should include per-setting breakdowns and statistical significance tests for the claimed improvements under gold evidence vs. retrieval.
- [§2] Related-work section should more explicitly contrast M³-VQA against prior multi-hop VQA benchmarks (e.g., those using text-only or single-image sources) to highlight the multimodal multi-entity novelty.
Simulated Author's Rebuttal
We are grateful to the referee for the insightful feedback on our paper introducing M³-VQA. The comments highlight important aspects for validating the benchmark's complexity. We provide point-by-point responses to the major comments and commit to revisions that include the suggested ablations and quantitative analyses.
read point-by-point responses
-
Referee: [§4] §4 (Experiments and Evaluation Settings): The central claim that poor results without external information demonstrate 'significant challenges for MLLMs in knowledge acquisition and complex reasoning' rests on the assumption that questions enforce multi-entity multi-hop reasoning. However, no ablation results are reported in which only partial evidence (one entity or one hop) is provided; without these controls, the marked gains under gold evidence could reflect direct lookup rather than enforced cross-entity integration.
Authors: We agree that ablations with partial evidence are crucial to validate that the performance improvements stem from multi-entity multi-hop reasoning rather than simpler lookups. In the revised manuscript, we will add new experiments providing models with only one entity's information or one reasoning hop. These controls will demonstrate the necessity of full integration for solving the questions, thereby reinforcing our claims about the challenges in complex multimodal reasoning. revision: yes
-
Referee: [§3] §3 (Benchmark Construction): The description of question generation and the multimodal KB does not include quantitative validation (e.g., distribution of required reasoning hops, inter-annotator agreement on whether single-entity shortcuts suffice, or human performance baselines on partial-evidence variants). This leaves the 'multi-hop' and 'multi-entity' properties of the benchmark under-supported for interpreting model failures.
Authors: We acknowledge that explicit quantitative validation would provide stronger evidence for the benchmark's properties. We will revise the manuscript to include: statistics on the distribution of reasoning hops and entities per question; inter-annotator agreement metrics assessing whether questions can be solved via single-entity shortcuts; and human performance results on partial-evidence variants. These additions will be based on further annotation and evaluation, directly addressing the concern. revision: yes
Circularity Check
No circularity: benchmark construction and model evaluation are independent of self-referential fitting or derivations.
full rationale
The paper introduces a new VQA benchmark with curated multimodal KB, multi-entity questions, and three evaluation settings (no external knowledge, gold evidence, retrieval). It reports empirical results on 16 MLLMs without any parameter fitting, predictive modeling, or derivation steps that reduce to the inputs by construction. Claims about model challenges rest on observed performance gaps, not on equations or self-citations that enforce the outcome. No load-bearing self-citation chains or ansatzes are present; the work is self-contained as an external evaluation benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Food-101–mining discriminative components with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzer- land, September 6-12, 2014, proceedings, part VI 13, pages 446–461. Springer. Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Ge- off Brown, Bryan A Plummer, Kate Saenko, Jianmo Ni, and Mandy Guo. 2023. A suite of genera...
-
[2]
Oktoberfest food dataset.arXiv preprint arXiv:1912.05007. A Additional Related Work In addition to the related work discussed in Section 2, there are several more relevant areas: A.1 Visual Question Answering Visual Question Answering refers to answering questions about images. Early VQA datasets such as DAQUAR (Malinowski and Fritz, 2014), GQA (Hudson an...
-
[3]
bridge entities
introduced the concept of "bridge entities" to construct multi-hop QA datasets. For example, in the question, “When was the singer and songwriter of Radiohead born?”, one must first infer that the singer and songwriter of Radiohead is Thom Yorke, and then retrieve his birth date. Here, Thom Yorke serves as the bridge entity. We adopt this concept in const...
2018
-
[4]
What kind of lion is on the left side of the pic- ture?
unified labels from these datasets under Wikipedia’s 6 million entities, proposing the task of Open-domain Visual Entity Recognition (OVEN), which requires models to link images to Wikipedia entities based on textual queries. These queries can be viewed as simple VQA tasks without detailed attribute reasoning. Unfortunately, OVEN lacks images involving mu...
2022
-
[5]
Represent this sentence for searching relevant passages:
Under the condition of S= 1 , the distribution ofIPandT Pin the dataset is shown in Table 8. E.3 Question Type We conducted heuristic identification for each ques- tion type in the dataset. To identify question types, we performed hierarchical parsing on each ques- tion. Each question sentence was split by spaces (interpreted as words or phrases), and up ...
2024
-
[6]
ViT-MLP-LLM
Based on the Qwen2.5-VL series, this model (Bai et al., 2025) has been optimized us- ing reinforcement learning for responses that better align with human subjective preferences, enhanc- ing mathematical reasoning, fine-grained image understanding, and reasoning capabilities. Qwen2-VLReleased in August 2024. This model (Wang et al., 2024) is Capable of un...
2025
-
[7]
Text" refers to text-based retrieval, “Image
employs BERT-based matching for evalu- ation, which is computationally expensive, espe- cially for our larger dataset (13K items) and nu- merous candidate answers. Some works also use GPT-based evaluation, but due to the number of candidate answers in our setting, this approach is costly and less stable. G.4 Quality of the IoU Metric To verify that our Io...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.