Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Dongsheng Ma; Shaoxu Sun; Shuochen Chang; Wentao Zhang; Yalin Feng; Yikang Wang; Yinglong Yang; Yuanzi Li; Yufei Chen; Yuhan Wang

arxiv: 2605.30698 · v1 · pith:23CI3LEOnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.MA

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Yuhan Wang , Shuochen Chang , Yalin Feng , Dongsheng Ma , Yuanzi Li , Zhengren Wang , Yinglong Yang , Yufei Chen

show 3 more authors

Yikang Wang Shaoxu Sun Wentao Zhang

This is my paper

Pith reviewed 2026-06-28 23:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA

keywords multi-agent VQAvisual evidence alignmentgrounded reasoningVLM consensusevidence consistencyvisual question answeringEAGLE framework

0 comments

The pith

Answer-level agreement alone is insufficient for reliable multi-agent VQA; aligned visual evidence from shared image regions is required for trustworthy consensus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when multiple vision-language models collaborate on visual questions, reaching the same textual answer does not guarantee the agents are drawing from the same parts of the image. This visual mismatch leaves room for collective hallucinations even when answers align. The proposed EAGLE framework makes each agent's grounding regions explicit so the agents can verify one another's visual evidence and let consistency among those regions determine the final output. A reader would care because existing multi-agent VQA methods import text-only discussion protocols that skip this visual-alignment step, leaving the multimodal case under-served.

Core claim

The central claim is that answer-level agreement is insufficient for reliable multi-agent VQA and that aligned visual evidence—shared support from the image regions agents rely on—is essential for trustworthy consensus. EAGLE implements this by explicitly exposing each agent's grounding regions as visual evidence, enabling mutual verification over the evidence, and using evidence consistency to guide final decision-making, achieving best average performance across domains on six VQA benchmarks while remaining training-free.

What carries the argument

EAGLE (Evidence-Aligned Grounded multi-agent Reasoning), the training-free framework that exposes grounding regions for mutual verification and consistency-guided decision-making.

If this is right

EAGLE achieves the best average performance across domains on six VQA benchmarks.
The method remains training-free, lightweight, interpretable, and practical for deployment.
Focusing on visual evidence alignment rather than textual discussion alone mitigates individual hallucinations and blind spots more effectively than text-centric protocols.
Existing multi-agent VQA approaches that adapt text-only protocols are insufficient for the multimodal setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If evidence alignment is the key mechanism, similar verification of grounding regions could be added to single-agent VLM pipelines to reduce hallucinations without multi-agent overhead.
Gains may vary with the accuracy of region extraction, suggesting direct tests that swap different grounding modules while holding other components fixed.
The same evidence-consistency step could be applied to other multi-model multimodal tasks such as visual chain-of-thought or joint image-captioning systems.

Load-bearing premise

That mutual verification over exposed grounding regions can be effectively implemented in VLMs and that evidence consistency reliably guides better decision-making.

What would settle it

A controlled comparison in which multi-agent systems reach high answer agreement but show no accuracy gain when required to align on visual evidence regions would falsify the claim that aligned visual evidence is essential.

Figures

Figures reproduced from arXiv: 2605.30698 by Dongsheng Ma, Shaoxu Sun, Shuochen Chang, Wentao Zhang, Yalin Feng, Yikang Wang, Yinglong Yang, Yuanzi Li, Yufei Chen, Yuhan Wang, Zhengren Wang.

**Figure 1.** Figure 1: A case illustrating why answer-level agreement can be misleading. (A) Agents may accept the same textual rationale without verifying whether it is supported by the correct visual regions. (B) Explicit grounding makes the supporting evidence comparable, allowing agents to verify whether their agreement is visually aligned. only textual rationales without exposing the supporting visual regions, limiting evi… view at source ↗

**Figure 2.** Figure 2: Overview of EAGLE. The pipeline consists of five modules: (1) Evidence Routing: guides grounding based on question type; (2) Grounded Answer: agents generate initial answers with visual evidence, including grounding regions and visual claims explaining how the grounded regions support the answer; (3) Evidence Diagnosis: evaluates consistency of answers and visual evidence across agents; (4) Grounded Revisi… view at source ↗

**Figure 3.** Figure 3: Parameter ablations. (A) Effect of the maximum number of revision rounds T, showing that one grounded revision is sufficient for reliable consensus; (B) sensitivity to IoU threshold τiou, with 0.4 providing the best spatial alignment across agents. ing removes explicit grounding boxes and keeps only textual visual descriptions; w/o Arbitration replaces evidence-guided arbitration with vote-based selectio… view at source ↗

read the original abstract

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAGLE pushes multi-agent VQA toward shared visual grounding instead of text agreement alone, but the abstract gives almost no experimental detail to judge whether the gains hold up.

read the letter

The main takeaway is that answer-level consensus is not enough when agents are looking at images; they also need to converge on the same visual evidence. EAGLE tries to enforce that by surfacing each agent's grounding regions and letting the agents check consistency across those regions before deciding.

What stands out is the clean separation between text discussion and visual evidence alignment. The method stays training-free and works on top of existing VLMs, which keeps the overhead low and the approach easy to reproduce. That framing is a useful shift from the text-centric multi-agent protocols that have dominated so far.

The weak part is the evaluation. The abstract claims best average performance across six benchmarks but supplies no baselines, no numbers, no error bars, and no description of how the evidence regions are extracted or compared. Without those pieces it is impossible to tell whether the visual-alignment step is actually driving the result or whether any careful prompting would have produced similar numbers. The method section in the abstract also stays high-level on the mutual-verification step, so the practical difference from existing grounding techniques is not yet clear.

This paper is aimed at people already building multi-agent VQA systems or trying to reduce hallucinations in vision-language models. A reader who needs concrete ablations and statistical support will have to wait for the full experiments. The central distinction the authors draw is coherent on its own terms, so the work is worth sending out for review so that referees can check the implementation details and the actual numbers.

Referee Report

2 major / 1 minor

Summary. The paper claims that answer-level agreement is insufficient for reliable multi-agent VQA and that aligned visual evidence—shared support from the image regions agents rely on—is essential for trustworthy consensus. It proposes EAGLE, a training-free evidence-centered framework that exposes each agent's grounding regions, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making, reporting best average performance across domains on six VQA benchmarks.

Significance. If the results hold, the work would advance multi-agent VLM collaboration by shifting focus from textual agreement to visual evidence alignment. The training-free design is a clear strength, supporting lightweight and practical deployment without additional fine-tuning costs.

major comments (2)

[Abstract] Abstract: The assertion that EAGLE 'achieves best average performance across domains' on six VQA benchmarks provides no information on baselines, statistical tests, error bars, dataset specifics, or controls for confounds, leaving the central empirical claim without verifiable support.
[EAGLE framework description] EAGLE framework description: The core mechanisms for exposing grounding regions, performing mutual verification, and applying evidence consistency lack concrete details on extraction, comparison, and differentiation from prior grounding techniques, which is load-bearing for evaluating whether the approach reliably improves decision-making.

minor comments (1)

[Abstract] The acronym expansion for EAGLE could be formatted more explicitly for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, providing clarifications from the manuscript and indicating where revisions have been made to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that EAGLE 'achieves best average performance across domains' on six VQA benchmarks provides no information on baselines, statistical tests, error bars, dataset specifics, or controls for confounds, leaving the central empirical claim without verifiable support.

Authors: We agree the abstract is concise by design and omits granular experimental details. The full manuscript (Section 4) specifies the six benchmarks (VQA v2, GQA, OK-VQA, A-OKVQA, TextVQA, VizWiz), lists all baselines (single-agent VLMs and prior multi-agent methods), reports per-dataset and average results with error bars, and includes controls for confounds such as agent count and prompting variations. Statistical comparisons are provided via paired t-tests in the supplementary material. To better support the claim in the abstract, we have revised it to name the benchmark domains and note the consistent outperformance, while directing readers to the experiments for full details. revision: yes
Referee: [EAGLE framework description] EAGLE framework description: The core mechanisms for exposing grounding regions, performing mutual verification, and applying evidence consistency lack concrete details on extraction, comparison, and differentiation from prior grounding techniques, which is load-bearing for evaluating whether the approach reliably improves decision-making.

Authors: Section 3 of the manuscript details these components: grounding regions are extracted via each agent's output of bounding boxes aligned to reasoning tokens (using the VLM's native localization capability); mutual verification computes region overlap via IoU thresholds and semantic consistency via CLIP embeddings; evidence consistency then weights the final answer by the fraction of agents sharing supporting regions above a threshold. Differentiation from prior single-agent grounding work (e.g., attention visualization or box prediction methods) is that EAGLE uses the shared evidence for cross-agent consensus rather than individual accuracy. We have expanded this section in revision with pseudocode, explicit extraction steps, and a new comparison table against prior techniques to make the mechanisms fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a conceptual insight (answer-level agreement insufficient; aligned visual evidence essential) and proposes the training-free EAGLE framework that exposes grounding regions for mutual verification. No equations, parameter fittings, self-definitional reductions, or load-bearing self-citations appear in the abstract or described method. The central claim is an observation used to motivate the framework rather than a derived result that collapses to its own inputs by construction. Experiments on external VQA benchmarks provide independent evaluation, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that VLMs can expose usable grounding regions and that consistency among them improves consensus; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Vision-language models can expose grounding regions as visual evidence that can be compared across agents
The framework depends on this capability to enable mutual verification and evidence consistency checks.

pith-pipeline@v0.9.1-grok · 5786 in / 1253 out tokens · 31478 ms · 2026-06-28T23:23:11.783052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 7 canonical work pages · 5 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Maciej Besta, Nils Blach, Ales Kubicek, Robert Gersten- berger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Pi- otr Nyczyk, and 1 others. 2024. Graph of thoughts: Solving elaborate problems with large language mod- els. InProceedings of the AAAI conferen...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025

Grounding answers for visual questions asked by visually impaired people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19098–19107. Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024b. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. InPro- ceedings of the 62nd Annual ...

work page arXiv 2025
[4]

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Global context or local detail? adaptive vi- sual grounding for hallucination mitigation.arXiv preprint arXiv:2604.24396. Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2025. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

InProceedings of the Computer Vision and Pattern Recognition Conference, pages 191–201

Multimodal rationales for explainable visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 191–201. Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. 2018. Vqa-e: Explaining, elaborating, and en- hancing your answers for visual questions. InPro- ceedings of the European Conference on Compute...

work page arXiv 2018
[6]

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Improving automatic vqa evaluation using large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, and 1 others. 2025. Localizing before answering: A benchmark for...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Kimi-VL Technical Report

Kimi-vl technical report.arXiv preprint arXiv:2504.07491. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multi- modal llms. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9568–9578. Khanh-Tung Tran, Dung Dao, Minh-Duong ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

What color is the car?

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Shixin Yi and Lin Shang. 2025. Corgi: Verified chain- of-thought reasoning with visual grounding.arXiv e-prints, pages arXiv–2508. Kepu Zhang, Weijie Yu, Sunhao Dai, and Jun Xu. 2025. Citalaw: Enhancing llm with citati...

2025
[9]

Inspect the image independently
[10]

Follow the evidence-grounding instruction
[11]

Answer the question with concise evidence-grounded reasoning
[12]

Provide one atomic visual claim that directly supports your answer
[13]

reasoning

Provide the grounding boxes for the image region(s) that support this visual claim. Output JSON schema: { "reasoning": "brief evidence-grounded reasoning", "visual_claim": "one atomic visual finding explaining how the grounded regions support the answer", "grounding_boxes": [ {"label": "object or region name", "box": [x1, y1, x2, y2]} ], "answer": "short ...
[14]

Keep the reasoning concise and tied to visible evidence in the image
[15]

The visual_claim must be a single atomic visual finding that directly supports the answer
[16]

The grounding_boxes must localize the region(s) that support the visual_claim
[17]

Use tight boxes around the relevant visual evidence whenever possible
[18]

Do not ground irrelevant objects, background regions, or the whole image unless the evidence focus requires global scene evidence
[19]

If no specific local region is decisive, return grounding_boxes: []
[20]

claim_aligned

Return raw JSON only. Grounding boxes and coordinate normalization. The grounding_boxes field localizes the image region(s) that support the visual claim. Since dif- ferent VLMs may emit boxes under different co- ordinate conventions, we normalize all predicted boxes into the original image pixel coordinate sys- tem before evidence diagnosis. This shared ...
[21]

Re-read the original image independently
[22]

Use self and peer hypotheses as visual references, not as authority
[23]

Keep your previous answer if the image still supports it
[24]

Revise only if a newly verified visual observation better supports another answer
[25]

answer":

Keep the reasoning concise and grounded in visible evidence. Output JSON: { "answer": "short final answer", "reasoning": "brief image-grounded reasoning", "grounding_boxes": [ {"label": "object name", "box": [x1, y1, x2, y2]} ], "visual_claim": "one atomic visual finding that directly supports the answer" } B.6 Evidence-Guided Arbitration When no evidence...

2024
[26]

Use explicit multi-step reasoning grounded in the image and question
[27]

Keep the reasoning focused and concrete rather than verbose
[28]

Self-Consistency(Wang et al., 2022)

Return raw JSON only. Self-Consistency(Wang et al., 2022). Self- Consistency samples multiple reasoning paths from a single model and aggregates their final answers by voting. For each question, we query the same backbone multiple times with the Zero-shot CoT prompt above. Each response contains its own rea- soning path and final answer. We then discard t...

2022
[29]

Extract the final answer from each sampled response
[30]

critique

Select the final answer by majority voting. Self-Refine(Madaan et al., 2023). Self-Refine iteratively improves a model’s own answer using self-generated feedback. For each sample, the model first generates an initial answer with the Zero-shot CoT prompt. It then critiques its own reasoning and answer, and finally produces a re- fined response conditioned ...

2023
[31]

Focus the critique on the most important possible error in the reasoning or answer
[32]

If the current answer is still supported by the image, keep it unchanged
[33]

Revise the answer only when the image provides evidence for the change
[34]

Do not introduce information that is not visible in the image
[35]

reasoning

Return raw JSON only. Multi-Agent Debate(Du et al., 2024; Liang et al., 2024). Multi-Agent Debate lets multiple agents exchange their answers and textual ratio- nales over multiple rounds. In the first round, each agent independently answers the question using the Zero-shot CoT prompt. In later rounds, each agent observes the other agents’ previous answer...

2024
[36]

Consider peers, but do not follow them blindly
[37]

Explain step by step how the peer evidence changes or confirms your view
[38]

Keep the reasoning concrete and tied to the image question
[39]

Debate Judge Prompt You are given an image question and the full state of a multi-round debate among several vision-language agents

Return raw JSON only. Debate Judge Prompt You are given an image question and the full state of a multi-round debate among several vision-language agents. Question: {question} Debate states: {debate_text} Task:
[40]

Read the image yourself
[41]

Use the debate states only as auxiliary evidence
[42]

Identify all candidate answers that appeared in the debate states
[43]

reasoning

Select the single best final answer from these candidate answers only. Output schema: { "reasoning": "brief image-grounded adjudication that explains why the selected candidate is best", "answer": "one candidate answer copied from the debate states" } Rules:
[44]

The image is the source of truth; do not blindly follow the debaters
[45]

You must choose one answer that already appears in the debate states
[46]

If multiple candidates are plausible, choose the one best supported by the image
[47]

reasoning

Return raw JSON only. ReConcile(Chen et al., 2024b). ReConcile is a confidence-driven multi-agent discussion frame- work. Each agent first provides an answer with a confidence score. Then, agents review grouped peer answers, justifications, and confidences before updating their predictions. After the final discus- sion round, we group semantically equival...
[48]

Base your answer on the image and question
[50]

Keep the reasoning focused and concrete
[51]

reasoning

Return raw JSON only. [Reconcile] 21 You are in a round-table conference with other agents. Review grouped peer answers, justifications, and confidences, then update your answer and confidence. Question: {question} Previous response: {previous_text} Grouped peer views: {peer_json} Output JSON: { "reasoning": "brief evidence-grounded reasoning after review...
[52]

Review each answer group and compare the supporting justifications
[53]

Keep your answer if it remains best supported by the image
[54]

Change your answer only if another group provides more convincing visual evidence
[55]

Confidence must reflect your final belief after reviewing all groups
[56]

selected_tools

Return raw JSON only. [Final confidence-aware aggregation] After the last discussion round, group semantically equivalent final answers. For each answer group y, compute its aggregation score as the sum of confidences from agents supporting y: score(y) = sum(confidence_i for agents whose final answer is y) Select the answer group with the highest score as...

2026
[57]

Select only tools that are useful for resolving the disagreement
[58]

grounding

Select "grounding" when agents disagree about where the relevant evidence is located
[59]

object_detection

Select "object_detection" when agents disagree about the presence or identity of objects
[60]

Select "ocr" when the question depends on visible text, letters, numbers, labels, or symbols
[61]

spatial_reasoning

Select "spatial_reasoning" when agents disagree about relative positions, directions, distances, or spatial configurations
[62]

captioning

Select "captioning" when global scene context may resolve the disagreement
[63]

attribute_detection

Select "attribute_detection" when agents disagree about visual attributes such as color, shape, material, state, or markings
[64]

reasoning

Select "reasoning" when the disagreement requires additional visual reasoning beyond direct perception
[65]

tool_name

Return raw JSON only. [Expert tool execution] Each selected tool is executed with its corresponding query. Tool implementations: - grounding: GroundingDINO. - object_detection: YOLOv11. - spatial_reasoning: SpaceLLaVA. - ocr: OCR-Qwen. - captioning / attribute_detection / reasoning: InternVL-2.5 MPO. Tool output format: { "tool_name": "tool_name", "query"...
[66]

Score each agent between 0 and 1
[67]

A high score means the agent's answer and reasoning are supported by the tool outputs
[68]

A low score means the agent's answer conflicts with or is unsupported by the tool outputs
[69]

Use the tool outputs as auxiliary evidence, not as the only criterion
[70]

reasoning

Return raw JSON only. [Tool-assisted discussion] You are in a tool-assisted multi-agent discussion. Review the grouped agent solutions, tool outputs, and tool-agreement scores, then update your answer. Question: {question} Previous response: {previous_text} Grouped agent solutions: {grouped_json} Tool outputs: {tool_json} Agreement scores: {score_json} Ou...
[71]

Prefer answers supported by reliable tool outputs, while keeping the original image question central
[72]

Use agreement scores as auxiliary evidence, not as the only criterion
[73]

Keep your answer if it remains best supported by the image and tool evidence
[74]

Change your answer only when another candidate is better supported by visual evidence
[75]

Confidence must be a number between 0 and 1
[76]

reasoning

Return raw JSON only. [Final aggregator] Choose the best final answer after reviewing post-discussion agent solutions, tool outputs, and tool-agreement scores. Question: {question} Post-discussion solutions: {discussion_json} Tool outputs: {tools_json} Agreement scores: {scores_json} Candidate answers: {candidate_answers} Output JSON: { "reasoning": "brie...
[77]

Select exactly one answer from Candidate answers
[78]

Do not invent a new answer or output an answer not proposed by any agent
[79]

Prefer answers supported by reliable tool outputs
[80]

Use vote counts, confidence scores, and tool-agreement scores together
[81]

Do not rely on tool scores alone if image-grounded reasoning contradicts them

Showing first 80 references.

[1] [1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Maciej Besta, Nils Blach, Ales Kubicek, Robert Gersten- berger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Pi- otr Nyczyk, and 1 others. 2024. Graph of thoughts: Solving elaborate problems with large language mod- els. InProceedings of the AAAI conferen...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model, 2025

Grounding answers for visual questions asked by visually impaired people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 19098–19107. Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024b. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. InPro- ceedings of the 62nd Annual ...

work page arXiv 2025

[4] [4]

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Global context or local detail? adaptive vi- sual grounding for hallucination mitigation.arXiv preprint arXiv:2604.24396. Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2025. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

InProceedings of the Computer Vision and Pattern Recognition Conference, pages 191–201

Multimodal rationales for explainable visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 191–201. Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. 2018. Vqa-e: Explaining, elaborating, and en- hancing your answers for visual questions. InPro- ceedings of the European Conference on Compute...

work page arXiv 2018

[6] [6]

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Improving automatic vqa evaluation using large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4171–4179. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, and 1 others. 2025. Localizing before answering: A benchmark for...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Kimi-VL Technical Report

Kimi-vl technical report.arXiv preprint arXiv:2504.07491. Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multi- modal llms. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9568–9578. Khanh-Tung Tran, Dung Dao, Minh-Duong ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

What color is the car?

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Shixin Yi and Lin Shang. 2025. Corgi: Verified chain- of-thought reasoning with visual grounding.arXiv e-prints, pages arXiv–2508. Kepu Zhang, Weijie Yu, Sunhao Dai, and Jun Xu. 2025. Citalaw: Enhancing llm with citati...

2025

[9] [9]

Inspect the image independently

[10] [10]

Follow the evidence-grounding instruction

[11] [11]

Answer the question with concise evidence-grounded reasoning

[12] [12]

Provide one atomic visual claim that directly supports your answer

[13] [13]

reasoning

Provide the grounding boxes for the image region(s) that support this visual claim. Output JSON schema: { "reasoning": "brief evidence-grounded reasoning", "visual_claim": "one atomic visual finding explaining how the grounded regions support the answer", "grounding_boxes": [ {"label": "object or region name", "box": [x1, y1, x2, y2]} ], "answer": "short ...

[14] [14]

Keep the reasoning concise and tied to visible evidence in the image

[15] [15]

The visual_claim must be a single atomic visual finding that directly supports the answer

[16] [16]

The grounding_boxes must localize the region(s) that support the visual_claim

[17] [17]

Use tight boxes around the relevant visual evidence whenever possible

[18] [18]

Do not ground irrelevant objects, background regions, or the whole image unless the evidence focus requires global scene evidence

[19] [19]

If no specific local region is decisive, return grounding_boxes: []

[20] [20]

claim_aligned

Return raw JSON only. Grounding boxes and coordinate normalization. The grounding_boxes field localizes the image region(s) that support the visual claim. Since dif- ferent VLMs may emit boxes under different co- ordinate conventions, we normalize all predicted boxes into the original image pixel coordinate sys- tem before evidence diagnosis. This shared ...

[21] [21]

Re-read the original image independently

[22] [22]

Use self and peer hypotheses as visual references, not as authority

[23] [23]

Keep your previous answer if the image still supports it

[24] [24]

Revise only if a newly verified visual observation better supports another answer

[25] [25]

answer":

Keep the reasoning concise and grounded in visible evidence. Output JSON: { "answer": "short final answer", "reasoning": "brief image-grounded reasoning", "grounding_boxes": [ {"label": "object name", "box": [x1, y1, x2, y2]} ], "visual_claim": "one atomic visual finding that directly supports the answer" } B.6 Evidence-Guided Arbitration When no evidence...

2024

[26] [26]

Use explicit multi-step reasoning grounded in the image and question

[27] [27]

Keep the reasoning focused and concrete rather than verbose

[28] [28]

Self-Consistency(Wang et al., 2022)

Return raw JSON only. Self-Consistency(Wang et al., 2022). Self- Consistency samples multiple reasoning paths from a single model and aggregates their final answers by voting. For each question, we query the same backbone multiple times with the Zero-shot CoT prompt above. Each response contains its own rea- soning path and final answer. We then discard t...

2022

[29] [29]

Extract the final answer from each sampled response

[30] [30]

critique

Select the final answer by majority voting. Self-Refine(Madaan et al., 2023). Self-Refine iteratively improves a model’s own answer using self-generated feedback. For each sample, the model first generates an initial answer with the Zero-shot CoT prompt. It then critiques its own reasoning and answer, and finally produces a re- fined response conditioned ...

2023

[31] [31]

Focus the critique on the most important possible error in the reasoning or answer

[32] [32]

If the current answer is still supported by the image, keep it unchanged

[33] [33]

Revise the answer only when the image provides evidence for the change

[34] [34]

Do not introduce information that is not visible in the image

[35] [35]

reasoning

Return raw JSON only. Multi-Agent Debate(Du et al., 2024; Liang et al., 2024). Multi-Agent Debate lets multiple agents exchange their answers and textual ratio- nales over multiple rounds. In the first round, each agent independently answers the question using the Zero-shot CoT prompt. In later rounds, each agent observes the other agents’ previous answer...

2024

[36] [36]

Consider peers, but do not follow them blindly

[37] [37]

Explain step by step how the peer evidence changes or confirms your view

[38] [38]

Keep the reasoning concrete and tied to the image question

[39] [39]

Debate Judge Prompt You are given an image question and the full state of a multi-round debate among several vision-language agents

Return raw JSON only. Debate Judge Prompt You are given an image question and the full state of a multi-round debate among several vision-language agents. Question: {question} Debate states: {debate_text} Task:

[40] [40]

Read the image yourself

[41] [41]

Use the debate states only as auxiliary evidence

[42] [42]

Identify all candidate answers that appeared in the debate states

[43] [43]

reasoning

Select the single best final answer from these candidate answers only. Output schema: { "reasoning": "brief image-grounded adjudication that explains why the selected candidate is best", "answer": "one candidate answer copied from the debate states" } Rules:

[44] [44]

The image is the source of truth; do not blindly follow the debaters

[45] [45]

You must choose one answer that already appears in the debate states

[46] [46]

If multiple candidates are plausible, choose the one best supported by the image

[47] [47]

reasoning

Return raw JSON only. ReConcile(Chen et al., 2024b). ReConcile is a confidence-driven multi-agent discussion frame- work. Each agent first provides an answer with a confidence score. Then, agents review grouped peer answers, justifications, and confidences before updating their predictions. After the final discus- sion round, we group semantically equival...

[48] [48]

Base your answer on the image and question

[49] [50]

Keep the reasoning focused and concrete

[50] [51]

reasoning

Return raw JSON only. [Reconcile] 21 You are in a round-table conference with other agents. Review grouped peer answers, justifications, and confidences, then update your answer and confidence. Question: {question} Previous response: {previous_text} Grouped peer views: {peer_json} Output JSON: { "reasoning": "brief evidence-grounded reasoning after review...

[51] [52]

Review each answer group and compare the supporting justifications

[52] [53]

Keep your answer if it remains best supported by the image

[53] [54]

Change your answer only if another group provides more convincing visual evidence

[54] [55]

Confidence must reflect your final belief after reviewing all groups

[55] [56]

selected_tools

Return raw JSON only. [Final confidence-aware aggregation] After the last discussion round, group semantically equivalent final answers. For each answer group y, compute its aggregation score as the sum of confidences from agents supporting y: score(y) = sum(confidence_i for agents whose final answer is y) Select the answer group with the highest score as...

2026

[56] [57]

Select only tools that are useful for resolving the disagreement

[57] [58]

grounding

Select "grounding" when agents disagree about where the relevant evidence is located

[58] [59]

object_detection

Select "object_detection" when agents disagree about the presence or identity of objects

[59] [60]

Select "ocr" when the question depends on visible text, letters, numbers, labels, or symbols

[60] [61]

spatial_reasoning

Select "spatial_reasoning" when agents disagree about relative positions, directions, distances, or spatial configurations

[61] [62]

captioning

Select "captioning" when global scene context may resolve the disagreement

[62] [63]

attribute_detection

Select "attribute_detection" when agents disagree about visual attributes such as color, shape, material, state, or markings

[63] [64]

reasoning

Select "reasoning" when the disagreement requires additional visual reasoning beyond direct perception

[64] [65]

tool_name

Return raw JSON only. [Expert tool execution] Each selected tool is executed with its corresponding query. Tool implementations: - grounding: GroundingDINO. - object_detection: YOLOv11. - spatial_reasoning: SpaceLLaVA. - ocr: OCR-Qwen. - captioning / attribute_detection / reasoning: InternVL-2.5 MPO. Tool output format: { "tool_name": "tool_name", "query"...

[65] [66]

Score each agent between 0 and 1

[66] [67]

A high score means the agent's answer and reasoning are supported by the tool outputs

[67] [68]

A low score means the agent's answer conflicts with or is unsupported by the tool outputs

[68] [69]

Use the tool outputs as auxiliary evidence, not as the only criterion

[69] [70]

reasoning

Return raw JSON only. [Tool-assisted discussion] You are in a tool-assisted multi-agent discussion. Review the grouped agent solutions, tool outputs, and tool-agreement scores, then update your answer. Question: {question} Previous response: {previous_text} Grouped agent solutions: {grouped_json} Tool outputs: {tool_json} Agreement scores: {score_json} Ou...

[70] [71]

Prefer answers supported by reliable tool outputs, while keeping the original image question central

[71] [72]

Use agreement scores as auxiliary evidence, not as the only criterion

[72] [73]

Keep your answer if it remains best supported by the image and tool evidence

[73] [74]

Change your answer only when another candidate is better supported by visual evidence

[74] [75]

Confidence must be a number between 0 and 1

[75] [76]

reasoning

Return raw JSON only. [Final aggregator] Choose the best final answer after reviewing post-discussion agent solutions, tool outputs, and tool-agreement scores. Question: {question} Post-discussion solutions: {discussion_json} Tool outputs: {tools_json} Agreement scores: {scores_json} Candidate answers: {candidate_answers} Output JSON: { "reasoning": "brie...

[76] [77]

Select exactly one answer from Candidate answers

[77] [78]

Do not invent a new answer or output an answer not proposed by any agent

[78] [79]

Prefer answers supported by reliable tool outputs

[79] [80]

Use vote counts, confidence scores, and tool-agreement scores together

[80] [81]

Do not rely on tool scores alone if image-grounded reasoning contradicts them