Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Chitta Baral; Krishna Singh Rajput; Tejas Anvekar; Vivek Gupta

arxiv: 2505.20816 · v2 · submitted 2025-05-27 · 💻 cs.CL

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Krishna Singh Rajput , Tejas Anvekar , Chitta Baral , Vivek Gupta This is my paper

Pith reviewed 2026-05-19 13:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal question answeringmulti-agent systemsvisual language modelscross-modal reasoninginformation synthesislarge language modelsinterpretability

0 comments

The pith

A multi-agent system with two specialized VLM agents and one LLM agent outperforms single-model baselines on multimodal QA benchmarks involving text, tables, and images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MAMMQA, a framework that splits multimodal question answering into distinct stages handled by separate agents rather than one unified model. The first VLM breaks the query into sub-questions and pulls partial answers from each available modality in sequence. The second VLM then combines and refines those pieces through cross-modal reasoning before an LLM produces the final cohesive response. This modular setup aims to improve both accuracy and the ability to trace how the answer was built. Experiments across several benchmarks show the approach beats existing methods in performance and stability.

Core claim

The central claim is that decomposing multimodal QA into a cooperative pipeline—where one VLM handles query decomposition and modality-specific retrieval, a second VLM performs cross-modal synthesis and refinement, and an LLM integrates the results—yields higher accuracy and robustness than approaches that rely on a single generalized reasoning strategy.

What carries the argument

The MAMMQA multi-agent pipeline, in which the first VLM sequentially retrieves partial answers from text, tables, and images, the second VLM synthesizes them via cross-modal reasoning, and the LLM produces the final answer.

If this is right

Reasoning steps become explicit and traceable because each agent produces an intermediate output that can be inspected.
Performance gains appear consistently across diverse multimodal QA benchmarks that mix text, tables, and images.
The system gains robustness because errors in one stage can be isolated rather than propagating through a single model.
Each agent stays within a narrower domain, allowing it to apply reasoning suited to its modality or task.
Individual components can be debugged or upgraded independently without retraining the entire system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged separation could be applied to multimodal tasks beyond question answering, such as captioning or retrieval.
Similar agent divisions might reduce hallucinations in other settings where synthesis across sources is required.
Testing the framework with additional specialized agents for subtasks like table parsing could reveal further gains.
The explicit pipeline may make it easier to incorporate human feedback at specific stages rather than only at the end.

Load-bearing premise

The second VLM can reliably synthesize and refine partial answers through cross-modal reasoning without introducing new errors or losing information correctly retrieved by the first agent.

What would settle it

A controlled test in which partial answers from the first VLM are correct but the second VLM's synthesized output contains added errors or omissions that reduce final accuracy below a single-agent baseline.

Figures

Figures reproduced from arXiv: 2505.20816 by Chitta Baral, Krishna Singh Rajput, Tejas Anvekar, Vivek Gupta.

**Figure 1.** Figure 1: Depicting Illustration for our proposed MAMMQA , with three agents: 1) Modality Expert, that extracts modality specific insights; 2) Cross Modal Systhesis Agent, that synchronises information across modalities with insights from Modality Expert; 3) Aggregator Agent, that ground the answer using extracted cross modal information. often obscure the unique structure and semantics of each modality, leading… view at source ↗

**Figure 2.** Figure 2: Aggregator Agent performance with and with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent split with one VLM decomposing and retrieving per modality, a second doing cross-modal synthesis, and an LLM integrating is a straightforward modular idea, but the outperformance claim sits on no visible numbers or ablations.

read the letter

The paper puts forward MAMMQA, a three-agent setup for multimodal QA over text, tables, and images. One VLM breaks the query into sub-questions and pulls partial answers sequentially from each modality. A second VLM then synthesizes and refines those pieces through cross-modal reasoning. An LLM finally combines everything into a single answer. The goal is better accuracy plus clearer reasoning steps than a single model handling all modalities at once.

Referee Report

3 major / 2 minor

Summary. The paper proposes MAMMQA, a multi-agent framework for multimodal question answering over text, tables, and images. It deploys two VLM agents and one LLM agent: the first VLM decomposes the query into sub-questions and sequentially retrieves partial answers from each modality; the second VLM performs cross-modal synthesis and refinement; the LLM then integrates the results into a final answer. The authors argue that the modular design improves interpretability and that experiments on diverse multimodal QA benchmarks show consistent gains in accuracy and robustness over existing baselines.

Significance. If the performance claims are substantiated with proper controls and ablations, the work could advance multimodal QA by demonstrating that explicit decomposition and cross-modal synthesis steps can yield more robust and interpretable results than single-model approaches.

major comments (3)

[Abstract] Abstract: the headline claim that the cooperative multi-agent framework 'consistently outperforms existing baselines in both accuracy and robustness' is asserted without any quantitative metrics, baseline names, dataset details, or statistical tests, leaving the central empirical contribution without visible supporting evidence in the manuscript.
[Framework description (Section 3)] Framework description (Section 3): the second VLM agent's synthesis and refinement step is presented without explicit mechanisms, guardrails, or error-tracing procedures to prevent information loss or introduction of new hallucinations, yet the overall performance claim depends on this step reliably improving the partial answers produced by the first agent.
[Experiments section] Experiments section: no ablation studies, component-wise error analysis, or comparison isolating the contribution of the synthesis step versus the decomposition step are described, which is required to substantiate that the multi-agent design (rather than other factors) drives the reported gains.

minor comments (2)

[Title] The title is missing punctuation between 'Answering' and 'A Multi-Agent Perspective'.
[Framework description (Section 3)] A high-level diagram or pseudocode of the agent interaction protocol would improve clarity of the modular design.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the cooperative multi-agent framework 'consistently outperforms existing baselines in both accuracy and robustness' is asserted without any quantitative metrics, baseline names, dataset details, or statistical tests, leaving the central empirical contribution without visible supporting evidence in the manuscript.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support for the performance claims. In the revised version, we will add a concise summary of key results (e.g., accuracy gains on specific benchmarks such as MMMU and ChartQA, with baseline names and dataset details) while preserving the abstract's brevity; full metrics, tables, and any statistical tests will remain in the experiments section. revision: yes
Referee: [Framework description (Section 3)] Framework description (Section 3): the second VLM agent's synthesis and refinement step is presented without explicit mechanisms, guardrails, or error-tracing procedures to prevent information loss or introduction of new hallucinations, yet the overall performance claim depends on this step reliably improving the partial answers produced by the first agent.

Authors: We acknowledge that the current description of the second VLM's cross-modal synthesis is high-level. We will expand Section 3 with explicit details on the prompting strategies, cross-modal reasoning steps, and any built-in verification or refinement procedures used to reduce hallucinations and information loss, including illustrative examples of the synthesis process. revision: yes
Referee: [Experiments section] Experiments section: no ablation studies, component-wise error analysis, or comparison isolating the contribution of the synthesis step versus the decomposition step are described, which is required to substantiate that the multi-agent design (rather than other factors) drives the reported gains.

Authors: We agree that ablations are necessary to isolate the contributions of decomposition and synthesis. We will add a new subsection in the experiments with component-wise ablations (e.g., full framework vs. decomposition-only and synthesis-only variants) and error analysis to demonstrate where the multi-agent design yields improvements over single-model baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity: architecture described and validated empirically on external benchmarks

full rationale

The paper presents MAMMQA as a modular multi-agent system with explicit roles for two VLMs and one LLM, where the first agent decomposes queries and retrieves per-modality answers, the second synthesizes via cross-modal reasoning, and the LLM produces the final output. This is a design proposal rather than a derivation from equations or fitted parameters. Performance claims rest on experiments across diverse multimodal QA benchmarks showing gains over baselines, which are external to the paper's own inputs. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The framework's interpretability and domain-expertise claims follow directly from the stated agent分工 without reducing to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified premise that assigning narrow roles to separate VLM and LLM agents produces better synthesis than a single generalized model; this is treated as a domain assumption without supporting derivation or prior results shown in the abstract.

axioms (1)

domain assumption Specialized agents operating on individual modalities followed by cross-modal synthesis will outperform a single generalized reasoning strategy
This premise is invoked to justify the three-agent division of labor and is presented as the key improvement over existing methods.

invented entities (1)

MAMMQA multi-agent framework no independent evidence
purpose: To decompose queries, retrieve partial answers per modality, synthesize via cross-modal reasoning, and integrate final answers
Newly introduced system architecture whose superiority is asserted but not independently evidenced in the abstract.

pith-pipeline@v0.9.0 · 5715 in / 1399 out tokens · 68429 ms · 2026-05-19T13:44:53.222333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 5 internal anchors

[1]

InarXiv preprint arXiv:2309.13007

Reconcile: Round-table conference improves reasoning via consensus among diverse llms. InarXiv preprint arXiv:2309.13007. Wenhu Chen, Ming wei Chang, Eva Schlinger, William Wang, and William Cohen

work page arXiv
[2]

InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online

Hy- bridQA: A dataset of multi-hop question answering over tabular and textual data. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computa- tional Linguistics. Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke...

work page 2020
[3]

Darryl Hannan, Akshay Jain, and Mohit Bansal

Binding lan- guage models in symbolic languages.Preprint, arXiv:2210.02875. Darryl Hannan, Akshay Jain, and Mohit Bansal

work page arXiv
[4]

Haohao Luo, Ying Shen, and Yang Deng

Manymodalqa: Modality disambiguation and qa over diverse inputs.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7879–7886. Haohao Luo, Ying Shen, and Yang Deng. 2023a. Unify- ing text, tables, and images for multimodal question answering. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, pages 8203–

work page 2023
[5]

Haohao Luo, Ying Shen, and Yang Deng. 2023b. Unify- ing text, tables, and images for multimodal question answering. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, pages 9355– 9367, Singapore. Association for Computational Lin- guistics. OpenAI, :, and Aaron Hurst et. al

work page 2023
[6]

GPT-4o System Card

Gpt-4o system card.Preprint, arXiv:2410.21276. Haritz Puerto, Gözde ¸ Sahin, and Iryna Gurevych

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Qwen2.5 Technical Report

Qwen2.5 technical report.Preprint, arXiv:2412.15115. Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu

work page internal anchor Pith review Pith/arXiv arXiv
[8]

InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 2026–2039, Albuquerque, New Mexico

UniRAG: Universal retrieval augmentation for large vision language mod- els. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 2026–2039, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Han- naneh Hajishirzi, and ...

work page 2025
[9]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gem- ini 1.5: Unlocking multimodal understanding across millions of tokens of context.Preprint, arXiv:2403.05530. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903. Qian Yang, Qian Chen, Wen Wang, Baotian Hu, and Min Zhang. 2023a. Enhancing multi-modal multi- hop question answering via structured knowledge and unified retrieval-generation. InProceedings of the 31st ACM International Conference on Multimedia, MM ’23,...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li

Turning tables: Generating examples from semi- structured tables for endowing language models with reasoning skills.CoRR, abs/2107.07261. Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li

work page arXiv
[12]

Preprint, arXiv:2306.16762

Unified language representation for question answering over text, tables, and images. Preprint, arXiv:2306.16762. Qing Zhang, Haocheng Lv, Jie Liu, Zhiyun Chen, Jiany- ong Duan, Hao Wang, Li He, and Mingying Xu

work page arXiv
[13]

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal chain-of-thought reasoning in language models. In arXiv preprint arXiv:2302.00923. Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

Ddcot: Duty-distinct chain-of- thought prompting for multimodal reasoning in lan- guage models.arXiv preprint arXiv:2310.16436. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

work page arXiv
[15]

TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021

Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance.Preprint, arXiv:2105.07624. 10 A Additional Experimental Results This section reports supplementary experiments extending our main evaluation. These include abla- tions analyzing model modularity, efficiency, and robustness. Ablation on Synthesizer and Single Expe...

work page arXiv
[16]

text shuffle

Model MultiModalQA Single ManyModalQA Single Expert Expert Qwen2.5-VL-7B76.37 72.17 89.90 85.70 Gemini 8B 65.84 61.64 87.91 83.71 Qwen2.5-VL-3B67.56 53.36 87.61 63.41 Table 6: Ablation analysis on the Synthesizer and single- expert variants. Removing the Synthesizer or unifying experts both reduce accuracy by 4-25 points, demon- strating the necessity of ...

work page 2023

[1] [1]

InarXiv preprint arXiv:2309.13007

Reconcile: Round-table conference improves reasoning via consensus among diverse llms. InarXiv preprint arXiv:2309.13007. Wenhu Chen, Ming wei Chang, Eva Schlinger, William Wang, and William Cohen

work page arXiv

[2] [2]

InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online

Hy- bridQA: A dataset of multi-hop question answering over tabular and textual data. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computa- tional Linguistics. Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke...

work page 2020

[3] [3]

Darryl Hannan, Akshay Jain, and Mohit Bansal

Binding lan- guage models in symbolic languages.Preprint, arXiv:2210.02875. Darryl Hannan, Akshay Jain, and Mohit Bansal

work page arXiv

[4] [4]

Haohao Luo, Ying Shen, and Yang Deng

Manymodalqa: Modality disambiguation and qa over diverse inputs.Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7879–7886. Haohao Luo, Ying Shen, and Yang Deng. 2023a. Unify- ing text, tables, and images for multimodal question answering. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, pages 8203–

work page 2023

[5] [5]

Haohao Luo, Ying Shen, and Yang Deng. 2023b. Unify- ing text, tables, and images for multimodal question answering. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, pages 9355– 9367, Singapore. Association for Computational Lin- guistics. OpenAI, :, and Aaron Hurst et. al

work page 2023

[6] [6]

GPT-4o System Card

Gpt-4o system card.Preprint, arXiv:2410.21276. Haritz Puerto, Gözde ¸ Sahin, and Iryna Gurevych

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Qwen2.5 Technical Report

Qwen2.5 technical report.Preprint, arXiv:2412.15115. Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 2026–2039, Albuquerque, New Mexico

UniRAG: Universal retrieval augmentation for large vision language mod- els. InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 2026–2039, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Han- naneh Hajishirzi, and ...

work page 2025

[9] [9]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gem- ini 1.5: Unlocking multimodal understanding across millions of tokens of context.Preprint, arXiv:2403.05530. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903. Qian Yang, Qian Chen, Wen Wang, Baotian Hu, and Min Zhang. 2023a. Enhancing multi-modal multi- hop question answering via structured knowledge and unified retrieval-generation. InProceedings of the 31st ACM International Conference on Multimedia, MM ’23,...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li

Turning tables: Generating examples from semi- structured tables for endowing language models with reasoning skills.CoRR, abs/2107.07261. Bowen Yu, Cheng Fu, Haiyang Yu, Fei Huang, and Yongbin Li

work page arXiv

[12] [12]

Preprint, arXiv:2306.16762

Unified language representation for question answering over text, tables, and images. Preprint, arXiv:2306.16762. Qing Zhang, Haocheng Lv, Jie Liu, Zhiyun Chen, Jiany- ong Duan, Hao Wang, Li He, and Mingying Xu

work page arXiv

[13] [13]

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal chain-of-thought reasoning in language models. In arXiv preprint arXiv:2302.00923. Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

Ddcot: Duty-distinct chain-of- thought prompting for multimodal reasoning in lan- guage models.arXiv preprint arXiv:2310.16436. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

work page arXiv

[15] [15]

TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021

Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance.Preprint, arXiv:2105.07624. 10 A Additional Experimental Results This section reports supplementary experiments extending our main evaluation. These include abla- tions analyzing model modularity, efficiency, and robustness. Ablation on Synthesizer and Single Expe...

work page arXiv

[16] [16]

text shuffle

Model MultiModalQA Single ManyModalQA Single Expert Expert Qwen2.5-VL-7B76.37 72.17 89.90 85.70 Gemini 8B 65.84 61.64 87.91 83.71 Qwen2.5-VL-3B67.56 53.36 87.61 63.41 Table 6: Ablation analysis on the Synthesizer and single- expert variants. Removing the Synthesizer or unifying experts both reduce accuracy by 4-25 points, demon- strating the necessity of ...

work page 2023