arxiv: 2603.20640 · v2 · submitted 2026-03-21 · 💻 cs.CL

Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention

Manh Nguyen , Anh Nguyen , Dung Nguyen , Svetha Venkatesh , Hung Le This is my paper

Pith reviewed 2026-05-15 07:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent debatediversity-aware retentionlarge language modelsmessage selectionnoise reductioniterative reasoningagent communication

0 comments p. Extension

The pith

Retaining only the most mutually disagreeing agent responses at each round improves multi-agent debate quality and scales better than broadcasting everything.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Diversity-Aware Retention (DAR) as a way to run multi-agent debates among large language models without sending every response to every participant at every step. Instead of relying on uncertain confidence scores to drop messages, DAR keeps only the subset of replies that disagree most strongly with one another and with the current majority vote, then forwards those unchanged. Experiments across reasoning and question-answering tasks show the method raises final accuracy, with the largest gains appearing when the number of agents is increased and noise would otherwise accumulate fastest. The authors conclude that controlling what agents hear is at least as important as controlling what they say.

Core claim

Diversity-Aware Retention selects, at each debate round, the subset of agent responses that maximize disagreement with each other and with the majority vote, then broadcasts only those original messages via an index-based mechanism; this selective propagation reduces noise and redundancy relative to full broadcasting or uncertainty-threshold filtering, and the resulting debates produce higher final accuracy on diverse reasoning and QA benchmarks, with gains that grow as the agent count increases.

What carries the argument

Diversity-Aware Retention (DAR): an explicit index-based selector that keeps the original agent responses whose pairwise disagreements with one another and with the majority vote are largest.

If this is right

Noise accumulation is the dominant failure mode when agent count grows, so selective retention yields larger relative gains at scale.
Preserving unmodified original messages avoids the distortion introduced by rewriting or summarizing.
The approach outperforms uncertainty-based filtering because it does not depend on calibrated scores or threshold tuning.
Final answer quality depends on the composition of what each agent receives, not merely on the total volume of messages generated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disagreement-maximizing filter could be applied to other multi-agent coordination tasks such as tool use or planning where redundant messages also waste context.
Because the selector is index-based and lightweight, it adds negligible compute compared with the cost of generating the messages themselves.
If the retained disagreements are later shown to correlate with specific error types, the method could be extended to retain messages that cover complementary error modes rather than pure diversity.

Load-bearing premise

Responses that disagree most with each other and the majority vote still contain the information required to reach the correct final answer and do not discard useful but less diverse contributions.

What would settle it

A controlled run in which the DAR-retained messages produce a wrong final answer while the full set of messages would have produced the correct answer, or in which accuracy falls rather than rises as the number of agents is increased.

Figures

Figures reproduced from arXiv: 2603.20640 by Anh Nguyen, Dung Nguyen, Hung Le, Manh Nguyen, Svetha Venkatesh.

**Figure 2.** Figure 2: Average performance over seven benchmarks for different numbers of agents [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: DAR recovers minority-correct answers while standard MAD fails. Example from Arithmetics [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average results on Arithmetics and Form.Log. over debate rounds [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: DAR retaining prompt Diversity-Aware Retention (DAR) Uncertainty score (Average Negative Log Likelihood) for this response: 0.123 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: DAR Uncertain Prompt. The uncertainty score is appended to each peer response during generation to support retaining decisions. The value shown is an illustrative example. Diversity-Aware Retention (DAR) Majority vote from last round: 123 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: DAR Vote Prompt. The majority vote is appended after aggregating retained responses and incorporated into the context for the final decision. The value shown is an illustrative example. A.4 Implementation Details We summarize the evaluation benchmarks, including the number of evaluation samples and representative examples, in [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for DAR w/o Index-based filter Your ONLY task is to choose a subset of agent ids. Return ONLY a Python-style list of agent ids. Valid agent IDs: {peers} Responses from agents: {message with ids} Criteria: choose the most certain agents [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for retaining Certain Answers Your ONLY task is to choose a subset of agent ids. Return ONLY a Python-style list of agent ids. Valid agent IDs: {peers} Responses from agents: {message with ids} Criteria: choose agents whose opinions are most similar agents [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for retaining Similar Answers Society Of Mind MAD-M2 Certain Answers Similar Answers DAR w/o LLM-based filter DAR (Ours) 0.0 0.5 1.0 1.5 2.0 Div e r sit y (× 1 0 6 ) 0.01 0.02 1.38 0.02 1.38 1.83 Arithmetics Society Of Mind MAD-M2 Certain Answers Similar Answers DAR w/o LLM-based filter DAR (Ours) 0 50 100 150 200 250 113.3 126.7 0.02 126.7 126.7 226.5 Form.Log [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 11.** Figure 11: Diversity of retained responses across retention strategies on Qwen2.5-3B. Similar result for [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Diversity of retained responses across retention strategies on Qwen2.5-1.5B. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Complete qualitative responses on Qwen2.5-1.5B (Majority Vote). [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Complete qualitative responses on Qwen2.5-1.5B (Society Of Mind). [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Complete qualitative responses on Qwen2.5-1.5B (DAR). [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

read the original abstract

Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems. Code is publicly available at https://github.com/DA2I2-SLM/DAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAR gives a simple index-based rule to keep only the most disagreeing responses in multi-agent LLM debates, which seems to help at scale but risks dropping correct answers that happen to align with the majority.

read the letter

The core idea here is a lightweight filter that, at each round, picks the subset of agent messages maximizing mutual disagreement plus disagreement with the current majority vote, then broadcasts only those. It replaces uncertainty-based filtering and leaves the original text unchanged. This is presented as new in its explicit disagreement-maximizing selection rather than confidence thresholds. The abstract reports consistent gains on reasoning and QA benchmarks, with the advantage growing as agent count increases, which fits the stated problem of noise accumulation from full broadcasting. Code is released, so the selection rule can be inspected directly. That is the main positive: a clean, reproducible tweak aimed at a real scaling pain point. The soft spot is the untested assumption that the diversity objective will retain the reasoning chains that lead to the ground-truth answer. Nothing enforces that a correct but low-disagreement response stays in; if several agents converge on the right answer, the rule could drop it to satisfy the disagreement metric. The stress-test concern about an anti-correlation between diversity and correctness therefore lands on the method itself. The reported improvements are plausible but rest on abstract-level claims without visible ablations, statistical tests, or direct comparisons showing the diversity rule outperforms generic message reduction. This is for groups already running multi-agent debate setups and looking for practical noise controls. A reader working on agent ensembles would get a usable idea to try. It deserves peer review because the problem is timely and the mechanism is straightforward enough to evaluate properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes Diversity-Aware Retention (DAR), a lightweight framework for multi-agent LLM debate that, at each round, selects and broadcasts only the subset of agent responses maximizing mutual disagreement plus disagreement with the current majority vote. This is claimed to reduce noise and redundancy compared to full broadcasting or uncertainty-based filtering. Experiments on reasoning and QA benchmarks are reported to show consistent performance gains that become more pronounced as the number of agents increases.

Significance. If the central empirical claim holds under rigorous validation, DAR could offer a practical, parameter-light way to scale multi-agent debate systems by mitigating noise accumulation without modifying original messages. Public code release is a positive factor for reproducibility.

major comments (2)

[§3] §3 (DAR mechanism): the selection rule retains responses that maximize disagreement with each other and the majority vote, yet no analysis or constraint ensures that reasoning chains leading to the ground-truth answer are preserved; a correct but low-disagreement response can be dropped, creating an untested risk that diversity and correctness are anti-correlated at scale.
[§4] §4 (Experiments): results claim consistent improvements and scaling benefits, but the manuscript provides no details on the number of independent runs, statistical significance tests, error bars, or ablations (e.g., DAR vs. random subset retention or vs. uncertainty thresholding), so it is not possible to attribute gains specifically to the diversity objective rather than generic noise reduction.

minor comments (2)

[Abstract / §4] The abstract and §4 refer to 'diverse reasoning and question answering benchmarks' without naming them or providing dataset statistics; this should be stated explicitly in the experimental setup.
[§3] The index-based retention procedure is described at a high level; adding pseudocode or a precise algorithmic description of how the maximal-disagreement subset is computed would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and outlining the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (DAR mechanism): the selection rule retains responses that maximize disagreement with each other and the majority vote, yet no analysis or constraint ensures that reasoning chains leading to the ground-truth answer are preserved; a correct but low-disagreement response can be dropped, creating an untested risk that diversity and correctness are anti-correlated at scale.

Authors: We agree that our DAR mechanism does not include an explicit mechanism to preserve reasoning chains that lead to the ground truth, and a correct response that aligns closely with the majority could be filtered out. This is a valid concern regarding the potential anti-correlation between diversity and correctness. However, our experiments demonstrate that DAR leads to improved performance on reasoning tasks, particularly at larger scales, which suggests that the benefits of reducing redundancy outweigh this risk in practice. To strengthen the paper, we will add a discussion in Section 3 on this limitation and include new experiments that track the retention rate of correct vs. incorrect responses across rounds. revision: yes
Referee: [§4] §4 (Experiments): results claim consistent improvements and scaling benefits, but the manuscript provides no details on the number of independent runs, statistical significance tests, error bars, or ablations (e.g., DAR vs. random subset retention or vs. uncertainty thresholding), so it is not possible to attribute gains specifically to the diversity objective rather than generic noise reduction.

Authors: We acknowledge that the experimental section lacks sufficient details on reproducibility and controls. In the revised manuscript, we will report results averaged over 5 independent runs with standard error bars, include statistical significance tests (e.g., paired t-tests against baselines), and add ablation studies comparing DAR to random retention and uncertainty-based filtering. This will allow us to better attribute the performance gains to the diversity-aware selection. revision: yes

Circularity Check

0 steps flagged

Empirical heuristic with no derivational circularity

full rationale

The paper proposes Diversity-Aware Retention (DAR) as a direct heuristic: at each round, retain the subset of responses maximizing mutual disagreement plus disagreement with the current majority vote. This selection rule is defined explicitly from the disagreement metrics and majority vote without any fitted parameters, self-referential definitions, or derivations that reduce to the inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are used to justify the core mechanism. Performance gains are shown via experiments on reasoning benchmarks, leaving the method self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard notions of disagreement and majority vote with no new free parameters, axioms, or invented entities described in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1016 out tokens · 41875 ms · 2026-05-15T07:36:56.326028+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting... F: (G_{r-1}, v_{r-1}) ↦ I_ret_r ⊆ I_r
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 3.1 (Variance and Effective Sample Size under Correlation)... Neff = N / (1 + (N-1)ρ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

First, perform any multiplications and divisions from left to right: - ( 6×15 = 90) - ( 0 / 22 = 0 ) So now our equation looks like this after these operations: {27 + 90 + 7 - 0}

work page
[2]

Therefore, the final answer is: ” {final answer: 124}.” Final: 117(✗) Figure 13: Complete qualitative responses on Qwen2.5-1.5B (Majority Vote)

Now we can add/subtract from left to right: - Add ( 27 + 90 = 117 ) - Then subtract ( 117 + 7 = 124 ) - Finally, there’s no other operation left on the number line to remove, so we’re done here. Therefore, the final answer is: ” {final answer: 124}.” Final: 117(✗) Figure 13: Complete qualitative responses on Qwen2.5-1.5B (Majority Vote). 22 Question:What ...

work page
[3]

Now the expression becomes: 27 + 90 + 7 - 0÷22

First, calculate any multiplication and division from left to right: 6×15 = 90. Now the expression becomes: 27 + 90 + 7 - 0÷22

work page
[4]

Now the expression simplifies further: 27 + 90 + 7 - 0

Next, perform the division: 0÷22 = 0. Now the expression simplifies further: 27 + 90 + 7 - 0

work page
[5]

Then, 117 + 7 = 124

Finally, add and subtract from left to right: 27 + 90 = 117. Then, 117 + 7 = 124. And finally, 124 - 0 = 124. The final answer is ( final answer: 124). Round 2 Agent 1:To solve this expression step by step while correctly applying the order of operations (PEMDAS), let’s break it down into parts: First, we perform the multiplication and division:{6 * 15 + ...

work page
[6]

**Multiplication**: (6×15 = 90)

work page
[7]

Now our expression looks like this after performing the multiplication and division: ( 27 + 90 + 7 - 0 )

**Division**:(0 / 22 = 0) because any number divided by zero is undefined or considered as zero. Now our expression looks like this after performing the multiplication and division: ( 27 + 90 + 7 - 0 )

work page
[8]

So, the final result is (124)

Perform addition and subtraction from left to right: ( 27 + 90 = 117 ), (117 + 7 = 124), (124 - 0 = 124). So, the final result is (124). Therefore, the final answer is: ”{final answer: 124}” Final: 117(✗) Figure 14: Complete qualitative responses on Qwen2.5-1.5B (Society Of Mind). 23 Question:What is the result of 27+6×15+7−0÷22? Correct Answer:124 Round ...

work page
[9]

Begin with ( 6×15 ): ( 6×15 = 90 )

work page
[10]

Next, evaluate ( 0 / 22): ( 0 / 22 = 0 )

work page
[11]

Substitute back into the expression: ( 27 + 90 + 7 - 0 )

work page
[12]

Thus, following proper order of operations and eliminating the undefined case of ( 0 / 22 ), the final result is:{final answer: 124}

Perform the addition and subtraction: ( 27 + 90 = 117 ), (117 + 7 = 124 ), (124 - 0 = 124). Thus, following proper order of operations and eliminating the undefined case of ( 0 / 22 ), the final result is:{final answer: 124}. Round 2 Agent 1:Here is your recent opinion:{final answer: 117}. Please show all calculations and explanations. Your final answer s...

work page
[13]

Multiply ( 6×15 ): ( 6×15 = 90 )

work page
[14]

Add the next term, (27): (27 + 90 = 117)

work page
[15]

Add ( 7): ( 117 + 7 = 124 )

work page
[16]

Subtract ( 0÷22): ( 0 / 22 = 0 )

work page
[17]

Finally, subtract ( 0 ) from ( 124 ): ( 124 - 0 = 124 ) Hence, the correct final answer is:{final answer: 124}. Agent 3:Additionally, consider your uncertainty score and provide a detailed explanation of how you arrived at your answer, including any mistakes or areas of confusion you encountered during your calculations. Lastly, thank everyone for their f...

work page
[18]

**Multiplication**:( 6×15 = 90)

work page
[19]

Thus, ( 27 + 90 + 7 - 0)

**Addition & Subtraction**: Begin by performing the remaining multiplications and divisions next, since they come before addition/subtraction: (0 ÷ 22 = 0). Thus, ( 27 + 90 + 7 - 0)

work page
[20]

Considering possible interpretational variances, such as differing emphases on whether zero is undefined or can be treated as zero

**Performing Addition and Subtraction**: (27 + 90 = 117 ), ( 117 + 7 = 124 ) By proceeding through these steps and confirming the absence of undefined division operations, we arrive at a precise final sum of **124**. Considering possible interpretational variances, such as differing emphases on whether zero is undefined or can be treated as zero. We have ...

work page