hub Canonical reference

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang · 2023 · cs.CL · arXiv 2305.19118

Canonical reference. 77% of citing Pith papers cite this work as background.

45 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 1 method 1

citation-polarity summary

background 10 baseline 1 support 1 use method 1

representative citing papers

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

cs.MA · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.

Learning to Interrupt in Language-based Multi-agent Communication

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.

What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

cs.CL · 2026-03-09 · unverdicted · novelty 7.0

Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact and FEVER.

Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation

cs.MA · 2026-04-29 · unverdicted · novelty 6.0

Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.

Preregistered Belief Revision Contracts

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.

PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage

cs.AI · 2026-04-04 · unverdicted · novelty 6.0

PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.

TRINITY: An Evolved LLM Coordinator

cs.LG · 2025-12-04 · unverdicted · novelty 6.0

A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

cs.AI · 2025-08-29 · unverdicted · novelty 6.0

PosterForest uses a Poster Tree intermediate representation and hierarchical multi-agent reasoning to generate coherent scientific posters without training, outperforming prior methods in evaluations.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

Mixture-of-Agents Enhances Large Language Model Capabilities

cs.CL · 2024-06-07 · unverdicted · novelty 6.0

A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

cs.CL · 2023-10-03 · conditional · novelty 6.0

DyLAN automatically selects and dynamically organizes LLM agents for collaboration, outperforming fixed-agent baselines on code generation, reasoning, and decision tasks with up to 25% accuracy gains on some MMLU subjects.

Large Language Models Cannot Self-Correct Reasoning Yet

cs.CL · 2023-10-03 · unverdicted · novelty 6.0

LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

cs.CL · 2023-09-29 · conditional · novelty 6.0

ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

cs.CL · 2023-09-07 · conditional · novelty 6.0

DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.

Cognitive Architectures for Language Agents

cs.AI · 2023-09-05 · accept · novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.

ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.

Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Adapts SPRT as a compute governor for multi-agent LLM debates using Beta-modeled consensus scores from an LLM judge, yielding 3.7x call reduction on GSM8K at -2pp accuracy versus fixed rounds.

Robust Multi-Agent LLMs under Byzantine Faults

cs.MA · 2026-05-09 · unverdicted · novelty 5.0

SAC is a decentralized iterative filter-and-refine protocol that achieves (F+1)-robustness in LLM multi-agent systems, suppressing Byzantine influence and improving performance on reasoning benchmarks where prior methods fail.

citing papers explorer

Showing 45 of 45 citing papers.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems cs.MA · 2024-10-09 · unverdicted · none · ref 34 · 2 links · internal anchor
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
Test-Time Hinting for Black-Box Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies cs.MA · 2026-05-12 · unverdicted · none · ref 14 · 2 links · internal anchor
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement cs.CV · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning cs.AI · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
Learning to Interrupt in Language-based Multi-agent Communication cs.CL · 2026-04-07 · unverdicted · none · ref 22 · internal anchor
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.
What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network cs.CL · 2026-03-09 · unverdicted · none · ref 29 · internal anchor
Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 88 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning cs.CL · 2026-05-03 · unverdicted · none · ref 3 · internal anchor
Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact and FEVER.
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation cs.MA · 2026-04-29 · unverdicted · none · ref 15 · internal anchor
Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.
Preregistered Belief Revision Contracts cs.AI · 2026-04-16 · unverdicted · none · ref 27 · internal anchor
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage cs.AI · 2026-04-04 · unverdicted · none · ref 27 · internal anchor
PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
TRINITY: An Evolved LLM Coordinator cs.LG · 2025-12-04 · unverdicted · none · ref 12 · internal anchor
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation cs.AI · 2025-08-29 · unverdicted · none · ref 9 · internal anchor
PosterForest uses a Poster Tree intermediate representation and hierarchical multi-agent reasoning to generate coherent scientific posters without training, outperforming prior methods in evaluations.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 115 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 15 · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration cs.CL · 2023-10-03 · conditional · none · ref 10 · internal anchor
DyLAN automatically selects and dynamically organizes LLM agents for collaboration, outperforming fixed-agent baselines on code generation, reasoning, and decision tasks with up to 25% accuracy gains on some MMLU subjects.
Large Language Models Cannot Self-Correct Reasoning Yet cs.CL · 2023-10-03 · unverdicted · none · ref 11 · internal anchor
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 21 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models cs.CL · 2023-09-07 · conditional · none · ref 52 · internal anchor
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 44 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling cs.AI · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
ExComm adds cross-agent conflict detection and soft belief correction plus trajectory diversification to agentic test-time scaling, yielding 5-6% gains over baselines on AIME and GAIA benchmarks.
Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection cs.LG · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
Adapts SPRT as a compute governor for multi-agent LLM debates using Beta-modeled consensus scores from an LLM judge, yielding 3.7x call reduction on GSM8K at -2pp accuracy versus fixed rounds.
Robust Multi-Agent LLMs under Byzantine Faults cs.MA · 2026-05-09 · unverdicted · none · ref 28 · internal anchor
SAC is a decentralized iterative filter-and-refine protocol that achieves (F+1)-robustness in LLM multi-agent systems, suppressing Byzantine influence and improving performance on reasoning benchmarks where prior methods fail.
When Independent Sampling Outperforms Agentic Reasoning cs.LG · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation cs.AI · 2026-05-03 · unverdicted · none · ref 7 · internal anchor
Twelve LLM agents in a 12 Angry Men jury setup almost always end in hung juries due to anchoring, with Llama-4-Scout showing more vote changes than GPT-4o, suggesting RLHF alignment intensity limits deliberative flexibility.
Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems cs.MA · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research cs.HC · 2026-04-20 · unverdicted · none · ref 65 · internal anchor
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 47 · internal anchor
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL · 2026-04-11 · unverdicted · none · ref 47 · internal anchor
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models cs.LG · 2026-03-31 · unverdicted · none · ref 7 · internal anchor
Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, generalizable policies.
ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction cs.CL · 2025-11-03 · unverdicted · none · ref 29 · internal anchor
ZoFia is a zero-shot fake news detection framework that uses hierarchical entity salience retrieval followed by multi-LLM adversarial debate to improve robustness over single-model approaches.
How Far Are We from Generating Missing Modalities with Foundation Models? cs.MM · 2025-06-04 · unverdicted · none · ref 46 · internal anchor
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate cs.AI · 2026-05-17 · unverdicted · none · ref 80 · internal anchor
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures cs.AI · 2026-04-20 · unverdicted · none · ref 80 · internal anchor
A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR) cs.LG · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperformance over self-consistency on benchmarks and production data.
Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems cs.SE · 2025-08-21 · unverdicted · none · ref 33 · internal anchor
Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model cs.CL · 2025-04-13 · unverdicted · none · ref 36 · internal anchor
QM-ToT applies Tree of Thoughts decomposition and evaluator layers to quantized LLMs, reporting accuracy gains from 34% to 50% on MedQAUSMLE for LLaMA2-70b and from 58.77% to 69.49% for LLaMA-3.1-8b, plus an 86.27% improvement in data distillation using only 3.9% of the data.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 86 · internal anchor
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
The Rise and Potential of Large Language Model Based Agents: A Survey cs.AI · 2023-09-14 · accept · none · ref 113 · internal anchor
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance cs.IR · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
A case-driven multi-agent system automates the full pipeline of bad-case detection, annotation, and resolution for e-commerce search relevance using Annotator, Optimizer, and User agents plus supporting components.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 108 · internal anchor
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges cs.CL · 2025-03-27 · accept · none · ref 80 · internal anchor
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
LLM Multi-Agent Systems: Challenges and Open Problems cs.MA · 2024-02-05 · unverdicted · none · ref 43 · internal anchor
The paper identifies inadequately addressed challenges in optimizing task allocation, fostering robust reasoning through debates, managing layered context, enhancing memory, and applying multi-agent systems to blockchain.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration cs.AI · 2026-05-19 · unreviewed · ref 11 · internal anchor

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer