Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Andreas Vlachos; Caiqi Zhang; Nigel Collier; Tom Stafford; Xiaochen Zhu; Yizhou Chi

arxiv: 2601.19921 · v3 · pith:WK5HZ5QFnew · submitted 2026-01-09 · 💻 cs.CL · cs.AI

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Xiaochen Zhu , Caiqi Zhang , Yizhou Chi , Tom Stafford , Nigel Collier , Andreas Vlachos This is my paper

classification 💻 cs.CL cs.AI

keywords debateconfidenceupdatesvanillaagentscalibratedconfidence-modulatedcorrect

0 comments

read the original abstract

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate
cs.MA 2026-05 unverdicted novelty 6.0

SVR-MAD treats pre-debate signals as priors and debate results as evidence to build a sparser communication graph, cutting token use by up to 61% while preserving or raising accuracy over prior MAD methods.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.