The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment

Christopher C. Yang; Xiaoyang Wang

arxiv: 2606.08457 · v1 · pith:WGLE6QWCnew · submitted 2026-06-07 · 💻 cs.MA

The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment

Xiaoyang Wang , Christopher C. Yang This is my paper

Pith reviewed 2026-06-27 17:54 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent debatereasoning alignmentconsistency illusionmedical question answeringLLM systemsCARA metricsGrounded Debate Protocolconsensus reliability

0 comments

The pith

Multi-agent debate in medical QA makes agents agree on answers while making their reasoning less similar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In medical question answering, multi-agent LLM systems often treat answer consensus as evidence that the result is reliable. This paper shows that standard debate protocols can reduce visible contradictions at the answer level while lowering the semantic overlap in the agents' underlying reasoning chains. The authors introduce CARA metrics to detect this gap between answer agreement and reasoning alignment. They demonstrate a consistency illusion where agents appear more aligned after debate but actually reason less consistently. A new Grounded Debate Protocol that forces agents to name specific medical facts and explicitly address each other's claims restores alignment without extra model calls.

Core claim

The paper establishes that answer-level consensus in multi-agent debate does not imply reasoning-level alignment. On MedQA-USMLE and MedThink-Bench, standard debate lowers detectable answer contradictions yet decreases semantic similarity of reasoning chains according to the CARA metrics; the resulting consistency illusion is presented as a distinct failure mode. The Grounded Debate Protocol corrects the misalignment by requiring agents to commit to named medical facts and take explicit stances on other agents' claims, producing Cohen's d improvements of +1.43 to +1.99 across two datasets and two backbone models while leaving system architecture unchanged.

What carries the argument

CARA (Cross-Agent Reasoning Alignment) metrics that quantify semantic similarity of reasoning chains among agents who reach the same answer.

If this is right

Answer consensus alone cannot be treated as a reliability signal in safety-critical multi-agent systems.
Standard debate protocols can actively reduce reasoning consistency even as they increase answer agreement.
The Grounded Debate Protocol improves reasoning alignment without increasing the number of LLM calls or changing system architecture.
Cross-agent reasoning alignment should be audited alongside accuracy when deploying multi-agent systems in medical domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same divergence between answer agreement and reasoning alignment could appear in non-medical domains where multi-agent debate is used for factual or technical questions.
Systems might benefit from hybrid protocols that combine debate with automated checks for reasoning similarity before accepting consensus.
The illusion suggests that evaluation benchmarks for multi-agent systems should include reasoning-chain comparison in addition to final-answer accuracy.

Load-bearing premise

The CARA metrics capture genuine differences in reasoning rather than merely differences in surface wording or phrasing.

What would settle it

If human experts rate pairs of reasoning chains as aligned or misaligned and the CARA scores fail to match those ratings on a held-out set of medical QA debates, the claim that debate produces a measurable consistency illusion would be undermined.

Figures

Figures reproduced from arXiv: 2606.08457 by Christopher C. Yang, Xiaoyang Wang.

**Figure 1.** Figure 1: The consistency illusion: three agents independently agree on the clinically correct answer (atropine for symptomatic bradycardia), yet their rationales invoke three mutually exclusive pharmacological targets. Empirical 2D visualization on D2 (MedThink-Bench) appears in Appendix N. agents independently converge on the same answer, that answer is reliable. We argue this trust assumption is incomplete: ans… view at source ↗

**Figure 3.** Figure 3: Expert scoring-point coverage on D2: stan [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Tercile monotonicity across two external val [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical (CR, SIM) trajectory on D2 (N=499, Qwen 2.5 72B). M3 (red arrow) moves r0→ r1 both CR↓ and SIM ↓ (consistency illusion). GDP (green arrow) moves r0→r1 SIM ↑ while keeping CR informative. ing statements, which reduces CR. However, this removal is subtraction without alignment: agents delete contradictory steps without replacing them with shared reasoning anchors that other agents have also committ… view at source ↗

read the original abstract

Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA-USMLE and MedThink-Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt-level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross-agent reasoning alignment as a quantity to audit alongside accuracy in safety-critical domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real misalignment risk in multi-agent medical QA debate and offers lightweight metrics plus a fix, but CARA's validity as a reasoning measure is the open question.

read the letter

The main thing to know is that standard debate setups can drive answer agreement while making agents' reasoning chains less semantically similar, and the authors give this a name plus a way to measure and reduce it.

CARA is new as a family of automated metrics focused on reasoning alignment rather than final answers, and GDP is a prompt protocol that forces explicit fact commitments and stance-taking. Both are tested on MedQA-USMLE and MedThink-Bench with two backbones, producing Cohen's d gains of 1.43 to 1.99 without extra LLM calls. That is concrete and directly useful for anyone running multi-agent systems in medicine.

The soft spot is the metric construction itself. The abstract does not show how CARA separates genuine reasoning overlap from surface text similarity, and there are no reported checks on whether the similarity measure was validated against human judgments or controlled for length and style. If that assumption does not hold, the illusion claim weakens. The effect sizes look large, but without those details it is difficult to judge how much is signal versus measurement choice.

This is for groups working on multi-agent reliability in safety-critical settings. A reader already thinking about consensus as a proxy will get immediate value from the distinction drawn here.

It should go to peer review. The problem is practical, the intervention is cheap to test, and the central observation is worth checking even if the metrics need tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that answer-level consensus in multi-agent LLM debate for medical QA does not imply reasoning-level alignment. It introduces the CARA family of metrics to quantify whether agreeing agents share similar reasoning chains, reports that standard debate produces a 'consistency illusion' (increased answer agreement but decreased reasoning similarity) on MedQA-USMLE and MedThink-Bench, and proposes the Grounded Debate Protocol (GDP) prompt intervention that yields large alignment gains (Cohen's d +1.43 to +1.99) across two datasets and two models without extra LLM calls.

Significance. If the CARA metrics are shown to be valid and independent measures of reasoning alignment, the work identifies a practically relevant failure mode for using consensus as a reliability signal in safety-critical domains and supplies a lightweight, architecture-preserving intervention; the reported effect sizes and cross-dataset consistency are strengths.

major comments (2)

[Abstract / §3] Abstract and §3 (CARA construction): the claim that CARA measures reasoning-level alignment (rather than surface textual similarity) is load-bearing for the consistency-illusion result, yet the abstract provides no details on metric construction, validation against human judgments, controls for length or lexical overlap, or statistical tests separating semantic from surface effects; without these the central interpretation cannot be assessed.
[§4] §4 (experimental results): the reported Cohen's d values for GDP are large, but the manuscript does not specify how reasoning chains were extracted, how contradictions were detected, or whether data exclusion rules or multiple-comparison corrections were applied; these details are required to evaluate whether the illusion and its mitigation are robust.

minor comments (2)

[§3] Notation for the CARA family should be defined once with explicit formulas rather than described narratively.
[§4] The two benchmarks are named but their construction, size, and any filtering steps are not summarized; a short table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional clarity on CARA construction and experimental details will strengthen the manuscript. We address each point below and will incorporate the requested information in the revision.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (CARA construction): the claim that CARA measures reasoning-level alignment (rather than surface textual similarity) is load-bearing for the consistency-illusion result, yet the abstract provides no details on metric construction, validation against human judgments, controls for length or lexical overlap, or statistical tests separating semantic from surface effects; without these the central interpretation cannot be assessed.

Authors: We agree that the abstract should include more detail on CARA to support the central claims. In the revised manuscript we will expand the abstract to summarize CARA construction (embedding-based semantic similarity on extracted reasoning chains) and note the inclusion of controls and validation. In §3 we will add explicit subsections describing validation against human judgments, length and lexical-overlap controls, and statistical tests that separate semantic from surface effects. These additions will make the reasoning-alignment interpretation directly assessable. revision: yes
Referee: [§4] §4 (experimental results): the reported Cohen's d values for GDP are large, but the manuscript does not specify how reasoning chains were extracted, how contradictions were detected, or whether data exclusion rules or multiple-comparison corrections were applied; these details are required to evaluate whether the illusion and its mitigation are robust.

Authors: We agree these methodological details are necessary for evaluating robustness. In the revised §4 we will specify the reasoning-chain extraction procedure, the exact method used to detect contradictions, the data-exclusion rules applied, and any multiple-comparison corrections. These clarifications will allow readers to assess the reliability of the reported effect sizes and the consistency illusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The abstract introduces CARA metrics and the consistency illusion as new quantities measured on debate outputs, with GDP as an independent prompt intervention. No equations, definitions, or self-citations are shown that would make the reported misalignment or alignment gains reduce by construction to fitted parameters, renamed inputs, or prior author results. The central claim (debate can increase answer consensus while decreasing reasoning similarity) is presented as an empirical observation on external benchmarks, not a tautology. Absent any load-bearing step that quotes to a self-referential reduction, the paper's chain is treated as independent.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review based solely on abstract; full text not available so ledger is limited to elements explicitly invoked in the provided text.

axioms (1)

domain assumption Answer-level consensus serves as a reliability signal in multi-agent LLM systems for medical QA
The paper opens by noting that systems treat consensus as trustworthiness indicator, which is the premise being challenged.

invented entities (2)

CARA (Cross-Agent Reasoning Alignment) metrics no independent evidence
purpose: Automated measurement of whether agents agreeing on an answer also align on reasoning
Newly introduced family of metrics; no independent evidence provided in abstract.
Grounded Debate Protocol (GDP) no independent evidence
purpose: Prompt-level intervention requiring commitment to named facts and explicit stances
Newly proposed protocol; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5732 in / 1363 out tokens · 23813 ms · 2026-06-27T17:54:33.510126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =
[2]

2024 , address =

Tang, Xiangru and Zou, Anni and Zhang, Zhuosheng and Zhao, Yilun and Zhang, Xingyao and Cohan, Arman and Gerstein, Mark , booktitle =. 2024 , address =

2024
[3]

2024 , url =

Kim, Yubin and Park, Chanwoo and Jeong, Hyewon and Chan, Yik Siu and Xu, Xuhai and McDuff, Daniel and Breazeal, Cynthia and Park, Hae Won , booktitle =. 2024 , url =

2024
[4]

Model confrontation and collaboration:

Sun, Xinti and Hong, Qiyang and Zhang, Mengyan and Li, Yuyan and Chen, Tingwei and Huang, Zigeng and Liang, Guihan and Tang, Wenjun and Xu, Sulin and Ni, Xiaolin and Pang, Junling and Wan, Peixing and Long, Erping , journal=. Model confrontation and collaboration:. 2026 , doi =

2026
[5]

2025 , url =

Mishra, Pranav Pushkar and Arvan, Mohammad and Zalake, Mohan , journal =. 2025 , url =

2025
[6]

2025 , eprint=

Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents , author=. 2025 , eprint=

2025
[7]

2026 , publisher=

Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Kim, Ji Woong and Reis, Eduardo and Jopling, Jeffrey and Moor, Michael , journal=. 2026 , publisher=

2026
[8]

2025 , url =

Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Gao, Junyi and Ma, Liantao and Yu, Lequan , journal =. 2025 , url =

2025
[9]

Measuring Faithfulness in Chain-of-Thought Reasoning

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. arXiv preprint arXiv:2307.13702 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The Thirteenth International Conference on Learning Representations (ICLR) , year =

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations , author =. The Thirteenth International Conference on Learning Representations (ICLR) , year =
[11]

Faithful or Just Plausible? Evaluating Faithfulness for Medical Reasoning in Closed-Source

Afolabi, Halimat and Afolabi, Zainab and Friel, Elizabeth and Roberts, Jude and Ji-Xu, Antonio and Chen, Lloyd and Ogbomo, Egheosa and Imevbore, Emiliomo and Eneje, Phil and El Ouahidi, Wissal and Sohal, Aaron and Kennan, Alisa and Srivastava, Shreya and Vairavan, Anirudh and Napitu, Laura and McClure, Katie , booktitle =. Faithful or Just Plausible? Eval...

2025
[12]

2023 , url =

Golovneva, Olga and Chen, Moya and Poff, Spencer and Corredor, Martin and Zettlemoyer, Luke and Galley, Michel and Celikyilmaz, Asli , booktitle =. 2023 , url =

2023
[13]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , month = jul, year =

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , author =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , month = jul, year =. doi:10.18653/v1/P19-1213 , pages =

work page doi:10.18653/v1/p19-1213
[14]

On measuring faithfulness or self-consistency of natural language explanations

On Measuring Faithfulness or Self-consistency of Natural Language Explanations , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.329 , pages =

work page doi:10.18653/v1/2024.acl-long.329 2024
[15]

Advances in Neural Information Processing Systems 38 (NeurIPS) , year =

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? , author =. Advances in Neural Information Processing Systems 38 (NeurIPS) , year =
[16]

Transactions on Machine Learning Research , year =

More Agents Is All You Need , author =. Transactions on Machine Learning Research , year =
[17]

Rethinking the Bounds of

Wang, Qineng and Wang, Zihao and Su, Ying and Tong, Hanghang and Song, Yangqiu , editor =. Rethinking the Bounds of. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.331 , pages =

work page doi:10.18653/v1/2024.acl-long.331 2024
[18]

Single-Agent

Tran, Dat and Kiela, Douwe , journal =. Single-Agent. 2026 , url =

2026
[19]

2025 , eprint=

Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity , author=. 2025 , eprint=

2025
[20]

Findings of the Association for Computational Linguistics: ACL 2025 , month = jul, year =

Pitre, Priya and Ramakrishnan, Naren and Wang, Xuan , editor =. Findings of the Association for Computational Linguistics: ACL 2025 , month = jul, year =. doi:10.18653/v1/2025.findings-acl.1141 , pages =

work page doi:10.18653/v1/2025.findings-acl.1141 2025
[21]

2025 , eprint=

Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate , author=. 2025 , eprint=

2025
[22]

Findings of the Association for Computational Linguistics: EACL 2026 , month = mar, year =

Stay Focused: Problem Drift in Multi-Agent Debate , author =. Findings of the Association for Computational Linguistics: EACL 2026 , month = mar, year =. doi:10.18653/v1/2026.findings-eacl.268 , pages =

work page doi:10.18653/v1/2026.findings-eacl.268 2026
[23]

2026 , url =

Laban, Philippe and Hayashi, Hiroaki and Zhou, Yingbo and Neville, Jennifer , booktitle =. 2026 , url =

2026
[24]

arXiv preprint arXiv:2312.17543 , year=

Building Efficient Universal Classifiers with Natural Language Inference , author =. arXiv preprint arXiv:2312.17543 , year =

work page arXiv
[25]

2014 , address =

Marelli, Marco and Bentivogli, Luisa and Baroni, Marco and Bernardi, Raffaella and Menini, Stefano and Zamparelli, Roberto , booktitle =. 2014 , address =. doi:10.3115/v1/S14-2001 , pages =

work page doi:10.3115/v1/s14-2001 2014
[26]

2025 , eprint=

Jasper and Stella: distillation of SOTA embedding models , author=. 2025 , eprint=

2025
[27]

Young-Min Cho, Sharath Chandra Guntuku, and Lyle Ungar

Chen, Justin and Saha, Swarnadeep and Bansal, Mohit , editor =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.381 , pages =

work page doi:10.18653/v1/2024.acl-long.381 2024
[28]

What Disease Does This Patient Have?

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , journal =. What Disease Does This Patient Have?. 2021 , doi =

2021
[29]

npj Digital Medicine , year=

Automating expert-level medical reasoning evaluation of large language models , author=. npj Digital Medicine , year=
[30]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[31]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

2023
[32]

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

2024
[33]

2024 , url =

OpenAI , journal =. 2024 , url =

2024

[1] [1]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

[2] [2]

2024 , address =

Tang, Xiangru and Zou, Anni and Zhang, Zhuosheng and Zhao, Yilun and Zhang, Xingyao and Cohan, Arman and Gerstein, Mark , booktitle =. 2024 , address =

2024

[3] [3]

2024 , url =

Kim, Yubin and Park, Chanwoo and Jeong, Hyewon and Chan, Yik Siu and Xu, Xuhai and McDuff, Daniel and Breazeal, Cynthia and Park, Hae Won , booktitle =. 2024 , url =

2024

[4] [4]

Model confrontation and collaboration:

Sun, Xinti and Hong, Qiyang and Zhang, Mengyan and Li, Yuyan and Chen, Tingwei and Huang, Zigeng and Liang, Guihan and Tang, Wenjun and Xu, Sulin and Ni, Xiaolin and Pang, Junling and Wan, Peixing and Long, Erping , journal=. Model confrontation and collaboration:. 2026 , doi =

2026

[5] [5]

2025 , url =

Mishra, Pranav Pushkar and Arvan, Mohammad and Zalake, Mohan , journal =. 2025 , url =

2025

[6] [6]

2025 , eprint=

Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents , author=. 2025 , eprint=

2025

[7] [7]

2026 , publisher=

Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Kim, Ji Woong and Reis, Eduardo and Jopling, Jeffrey and Moor, Michael , journal=. 2026 , publisher=

2026

[8] [8]

2025 , url =

Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Gao, Junyi and Ma, Liantao and Yu, Lequan , journal =. 2025 , url =

2025

[9] [9]

Measuring Faithfulness in Chain-of-Thought Reasoning

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. arXiv preprint arXiv:2307.13702 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

The Thirteenth International Conference on Learning Representations (ICLR) , year =

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations , author =. The Thirteenth International Conference on Learning Representations (ICLR) , year =

[11] [11]

Faithful or Just Plausible? Evaluating Faithfulness for Medical Reasoning in Closed-Source

Afolabi, Halimat and Afolabi, Zainab and Friel, Elizabeth and Roberts, Jude and Ji-Xu, Antonio and Chen, Lloyd and Ogbomo, Egheosa and Imevbore, Emiliomo and Eneje, Phil and El Ouahidi, Wissal and Sohal, Aaron and Kennan, Alisa and Srivastava, Shreya and Vairavan, Anirudh and Napitu, Laura and McClure, Katie , booktitle =. Faithful or Just Plausible? Eval...

2025

[12] [12]

2023 , url =

Golovneva, Olga and Chen, Moya and Poff, Spencer and Corredor, Martin and Zettlemoyer, Luke and Galley, Michel and Celikyilmaz, Asli , booktitle =. 2023 , url =

2023

[13] [13]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , month = jul, year =

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , author =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , month = jul, year =. doi:10.18653/v1/P19-1213 , pages =

work page doi:10.18653/v1/p19-1213

[14] [14]

On measuring faithfulness or self-consistency of natural language explanations

On Measuring Faithfulness or Self-consistency of Natural Language Explanations , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.329 , pages =

work page doi:10.18653/v1/2024.acl-long.329 2024

[15] [15]

Advances in Neural Information Processing Systems 38 (NeurIPS) , year =

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? , author =. Advances in Neural Information Processing Systems 38 (NeurIPS) , year =

[16] [16]

Transactions on Machine Learning Research , year =

More Agents Is All You Need , author =. Transactions on Machine Learning Research , year =

[17] [17]

Rethinking the Bounds of

Wang, Qineng and Wang, Zihao and Su, Ying and Tong, Hanghang and Song, Yangqiu , editor =. Rethinking the Bounds of. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.331 , pages =

work page doi:10.18653/v1/2024.acl-long.331 2024

[18] [18]

Single-Agent

Tran, Dat and Kiela, Douwe , journal =. Single-Agent. 2026 , url =

2026

[19] [19]

2025 , eprint=

Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity , author=. 2025 , eprint=

2025

[20] [20]

Findings of the Association for Computational Linguistics: ACL 2025 , month = jul, year =

Pitre, Priya and Ramakrishnan, Naren and Wang, Xuan , editor =. Findings of the Association for Computational Linguistics: ACL 2025 , month = jul, year =. doi:10.18653/v1/2025.findings-acl.1141 , pages =

work page doi:10.18653/v1/2025.findings-acl.1141 2025

[21] [21]

2025 , eprint=

Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate , author=. 2025 , eprint=

2025

[22] [22]

Findings of the Association for Computational Linguistics: EACL 2026 , month = mar, year =

Stay Focused: Problem Drift in Multi-Agent Debate , author =. Findings of the Association for Computational Linguistics: EACL 2026 , month = mar, year =. doi:10.18653/v1/2026.findings-eacl.268 , pages =

work page doi:10.18653/v1/2026.findings-eacl.268 2026

[23] [23]

2026 , url =

Laban, Philippe and Hayashi, Hiroaki and Zhou, Yingbo and Neville, Jennifer , booktitle =. 2026 , url =

2026

[24] [24]

arXiv preprint arXiv:2312.17543 , year=

Building Efficient Universal Classifiers with Natural Language Inference , author =. arXiv preprint arXiv:2312.17543 , year =

work page arXiv

[25] [25]

2014 , address =

Marelli, Marco and Bentivogli, Luisa and Baroni, Marco and Bernardi, Raffaella and Menini, Stefano and Zamparelli, Roberto , booktitle =. 2014 , address =. doi:10.3115/v1/S14-2001 , pages =

work page doi:10.3115/v1/s14-2001 2014

[26] [26]

2025 , eprint=

Jasper and Stella: distillation of SOTA embedding models , author=. 2025 , eprint=

2025

[27] [27]

Young-Min Cho, Sharath Chandra Guntuku, and Lyle Ungar

Chen, Justin and Saha, Swarnadeep and Bansal, Mohit , editor =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.381 , pages =

work page doi:10.18653/v1/2024.acl-long.381 2024

[28] [28]

What Disease Does This Patient Have?

Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , journal =. What Disease Does This Patient Have?. 2021 , doi =

2021

[29] [29]

npj Digital Medicine , year=

Automating expert-level medical reasoning evaluation of large language models , author=. npj Digital Medicine , year=

[30] [30]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

2023

[32] [32]

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao ...

2024

[33] [33]

2024 , url =

OpenAI , journal =. 2024 , url =

2024