Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

· 2025 · cs.CL · arXiv 2505.20816

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

representative citing papers

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

EAGLE is a new evidence-aligned framework that improves multi-agent VQA by enforcing consistency in visual grounding across agents, achieving best average performance on six benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence cs.CV · 2026-05-29 · unverdicted · none · ref 6 · internal anchor
EAGLE is a new evidence-aligned framework that improves multi-agent VQA by enforcing consistency in visual grounding across agents, achieving best average performance on six benchmarks.

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

fields

years

verdicts

representative citing papers

citing papers explorer