arxiv: 2305.14325 · v1 · submitted 2023-05-23 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Recognition: no theorem link

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du , Shuang Li , Antonio Torralba , Joshua B. Tenenbaum , Igor Mordatch

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords multiagent debatelanguage modelsreasoningfactualityhallucinationspromptingmulti-round discussion

0 comments

The pith

Multiple language model instances improve their answers by debating proposals and reasoning over multiple rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce a method where several instances of a language model each suggest an answer and then discuss their reasoning in successive rounds until they agree on a final response. This leads to stronger results in areas like math problems and strategic games, along with fewer fabricated details or logical slips. A reader would find this relevant because the technique applies to off-the-shelf models using only text prompts, without needing to alter the underlying system. It points to a path for making AI-generated information more reliable through interaction rather than isolated generation. The approach stays uniform no matter the specific task.

Core claim

The central discovery is that a multiagent debate setup, in which distinct language model copies propose individual responses and then engage in iterative exchanges of arguments and critiques, produces a consensus answer that outperforms standard single-model outputs in mathematical and strategic reasoning tasks while also increasing the factual correctness of the content.

What carries the argument

The multi-round debate mechanism among multiple LLM instances that allows proposal, critique, and convergence on a final answer.

If this is right

Improved performance on mathematical reasoning tasks.
Better results in strategic reasoning scenarios.
Reduced incidence of incorrect factual statements and hallucinations.
Usable on any existing language model through prompting alone.
Same method works across different tasks without customization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The debate format might help in domains requiring creative problem solving by simulating diverse perspectives.
It suggests that error correction can emerge from interaction even if individual models share biases.
Further tests could examine whether the benefits persist when models are from different families or sizes.
This could inform designs for AI systems that incorporate internal deliberation steps.

Load-bearing premise

Debate among the models converges on the truth instead of creating agreement around a common mistake.

What would settle it

Running the debate process on a set of questions where all models start with the same wrong answer and checking if they still output that wrong answer at the end.

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a multi-agent debate framework in which multiple instances of an LLM generate initial answers and reasoning, then iteratively critique and refine each other's outputs over several rounds before converging on a final consensus answer. The method is evaluated on mathematical reasoning (e.g., GSM8K-style problems), strategic reasoning tasks, and factuality benchmarks, with the central claim that this 'society of minds' interaction yields substantial gains in accuracy and reduced hallucinations relative to standard single-prompt or few-shot baselines, using identical procedures across black-box models.

Significance. If the reported gains prove robust, the work offers a practical, training-free prompting technique that leverages inter-agent interaction to improve reasoning and factuality beyond what independent sampling provides. It extends prior ideas such as self-consistency and verification by introducing explicit debate, and its applicability to existing models without internal access or parameter changes makes it potentially impactful for real-world LLM deployment.

major comments (3)

[§4 and Table 2] §4 (Experimental Setup) and Table 2: the reported accuracy improvements on reasoning tasks are not accompanied by direct comparisons to strong baselines such as self-consistency sampling or majority vote over an equivalent number of independent generations; without these, it is impossible to determine whether the iterative debate supplies corrective signal beyond increased sampling.
[§5.2] §5.2 (Factuality Experiments): the evaluation of hallucination reduction relies on automatic metrics and human judgments whose inter-annotator agreement and statistical significance are not reported; this weakens the claim that debate specifically reduces fallacious answers rather than simply producing more plausible consensus text.
[§3.1] §3.1 (Debate Protocol): the description does not specify the exact prompt templates used for critique rounds or the tie-breaking rule when agents fail to reach consensus after the final round; these details are load-bearing for reproducibility and for isolating whether gains arise from the interaction itself.

minor comments (2)

[Abstract and §1] The abstract and introduction use the phrase 'significantly enhances' without defining the threshold or providing effect sizes; this should be qualified with reference to the specific tables.
[Figure 1] Figure 1 (debate diagram) would benefit from an example trace showing an actual correction that occurs across rounds rather than a schematic only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of the manuscript. We address each major comment below and have revised the paper accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [§4 and Table 2] §4 (Experimental Setup) and Table 2: the reported accuracy improvements on reasoning tasks are not accompanied by direct comparisons to strong baselines such as self-consistency sampling or majority vote over an equivalent number of independent generations; without these, it is impossible to determine whether the iterative debate supplies corrective signal beyond increased sampling.

Authors: We agree that direct comparisons to self-consistency (Wang et al. 2023) and majority voting over an equivalent number of independent generations are necessary to isolate the benefit of iterative critique. In the revised manuscript we have added these baselines to Table 2 and §4, using the same total number of model calls as our debate setup (e.g., 3 or 5 generations). The new results show that multi-agent debate still yields statistically significant gains over both self-consistency and majority vote on GSM8K and strategic reasoning tasks, indicating that the corrective signal arises from the interaction rather than sampling alone. We have also clarified the experimental controls in §4. revision: yes
Referee: [§5.2] §5.2 (Factuality Experiments): the evaluation of hallucination reduction relies on automatic metrics and human judgments whose inter-annotator agreement and statistical significance are not reported; this weakens the claim that debate specifically reduces fallacious answers rather than simply producing more plausible consensus text.

Authors: We acknowledge that reporting inter-annotator agreement and statistical significance is essential. In the revision we have added Cohen’s kappa scores for the human factuality annotations (reported in the new Table 5) and performed McNemar’s tests to establish statistical significance of the hallucination reduction. We also clarify that the automatic metrics are drawn from TruthfulQA and that the debate protocol explicitly prompts agents to critique factual errors, not merely to produce fluent text. These additions are now in §5.2. revision: yes
Referee: [§3.1] §3.1 (Debate Protocol): the description does not specify the exact prompt templates used for critique rounds or the tie-breaking rule when agents fail to reach consensus after the final round; these details are load-bearing for reproducibility and for isolating whether gains arise from the interaction itself.

Authors: We thank the referee for highlighting this reproducibility gap. The exact prompt templates for the critique rounds have been added to Appendix A. For the tie-breaking rule, when agents do not converge after the maximum number of rounds we fall back to majority vote over the final responses, with a uniform random choice in the event of a tie; this procedure is now stated explicitly in §3.1. These changes allow readers to replicate the interaction dynamics precisely. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical prompting method with independent experimental support

full rationale

The paper introduces a multi-agent debate prompting procedure for LLMs and evaluates it empirically on math, strategy, and factuality tasks. No equations, derivations, fitted parameters, or ansatzes are present. Claims rest on reported performance gains from black-box model experiments rather than any internal reduction to inputs or self-citation chains. The approach is self-contained against external benchmarks, with no load-bearing self-definitional steps or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms or invented entities are invoked; the work is an empirical prompting method whose assumptions (e.g., that models can usefully critique each other) are implicit and unstated in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 941 out tokens · 41044 ms · 2026-05-12T02:56:58.178009+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 7.0

Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
cs.MA 2026-05 unverdicted novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
cs.AI 2026-05 unverdicted novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
cs.AI 2026-04 unverdicted novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
cs.CR 2026-04 unverdicted novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
cs.AI 2026-04 unverdicted novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
Learning to Interrupt in Language-based Multi-agent Communication
cs.CL 2026-04 unverdicted novelty 7.0

HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Measuring Faithfulness in Chain-of-Thought Reasoning
cs.AI 2023-07 conditional novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 6.0

Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
cs.AI 2026-05 unverdicted novelty 6.0

A multi-agent council of Gemini agents using absence-based clinical rules achieves F1 0.406 for defense mechanism classification, placing second among 64 teams, with overrides from fine-tuned models adding 2.4pp.
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
cs.AI 2026-05 unverdicted novelty 6.0

A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides ...
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration
cs.AI 2026-05 unverdicted novelty 6.0

Context injection in multi-agent design shows a crossover effect, improving exploration up to 20x on some tasks but reducing it by 46% on others, predicted by baseline exploration levels with r=-0.82.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
Pact: A Choreographic Language for Agentic Ecosystems
cs.PL 2026-05 unverdicted novelty 6.0

Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
cs.MA 2026-04 unverdicted novelty 6.0

Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation
cs.CR 2026-04 unverdicted novelty 6.0

Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology
cs.AI 2026-03 accept novelty 6.0

A 7x6 matrix classifies AI agent patterns into 27 types by combining cognitive functions and execution topologies, yielding five empirical laws linking task constraints to architectural choices.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Large Language Models Cannot Self-Correct Reasoning Yet
cs.CL 2023-10 unverdicted novelty 6.0

LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
cs.CL 2023-08 conditional novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations
cs.AI 2026-05 unverdicted novelty 5.0

Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
cs.AI 2026-05 unverdicted novelty 5.0

Twelve LLM agents in a 12 Angry Men jury setup almost always end in hung juries due to anchoring, with Llama-4-Scout showing more vote changes than GPT-4o, suggesting RLHF alignment intensity limits deliberative flexibility.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...
Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 5.0

MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.
Enhancing LLM Problem Solving via Tutor-Student Multi-Agent Interaction
cs.AI 2026-04 unverdicted novelty 5.0

A same-LLM tutor-student agent pair solves coding tasks at similar or higher accuracy than self-consistency or debate baselines while using significantly fewer tokens.
AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems
cs.AI 2026-04 unverdicted novelty 5.0

AIVV deploys LLM agents in a council to semantically validate anomalies in time-series data against natural-language requirements, automating human-in-the-loop verification for autonomous systems.
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
cs.LG 2026-03 unverdicted novelty 5.0

Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, genera...
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
cs.AI 2026-03 unverdicted novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
cs.CL 2023-05 conditional novelty 5.0

Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
cs.AI 2025-01 unverdicted novelty 4.0

The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and ident...
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Latency and Cost of Multi-Agent Intelligent Tutoring at Scale
cs.CY 2026-04 unverdicted novelty 3.0

Priority PayGo keeps multi-agent tutoring responses under 4 seconds even at 50 concurrent users, while costs stay below textbook prices per student.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 45 Pith papers · 16 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: A visual language model for few-shot learning. NeurIPS, 2022. URL https://arxiv.org/abs/2204.14198. 9

work page internal anchor Pith review arXiv 2022
[2]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Neural Information Processing Systems, 2017. 9

work page 2017
[3]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Y . Du, S. Li, and I. Mordatch. Compositional visual generation with energy based models. In Advances in Neural Information Processing Systems, 2020. 9

work page 2020
[5]

Y . Du, C. Durkan, R. Strudel, J. B. Tenenbaum, S. Dieleman, R. Fergus, J. Sohl-Dickstein, A. Doucet, and W. Grathwohl. Reduce, reuse, recycle: Compositional generation with energy- based diffusion models and mcmc. arXiv preprint arXiv:2302.11552, 2023. 9

work page arXiv 2023
[6]

Fsmosca/pgn-standard: Portable game notation specification and implementation guide

Fsmosca. Fsmosca/pgn-standard: Portable game notation specification and implementation guide. URL https://github.com/fsmosca/PGN-Standard. 5

work page
[7]

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. 9

work page internal anchor Pith review arXiv 2002
[8]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei. Ai safety via debate. arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review arXiv
[10]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. H. Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. 2, 6, 9 10

work page internal anchor Pith review arXiv 2022
[12]

N. Lee, W. Ping, P. Xu, M. Patwary, P. N. Fung, M. Shoeybi, and B. Catanzaro. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599, 2022. 9

work page 2022
[13]

Solving Quantitative Reasoning Problems with Language Models

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022. 9

work page internal anchor Pith review arXiv 2022
[14]

S. Li, Y . Du, J. B. Tenenbaum, A. Torralba, and I. Mordatch. Composing ensembles of pre-trained models via iterative consensus. arXiv preprint arXiv:2210.11522, 2022. 9

work page arXiv 2022
[15]

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 (6624):1092–1097, 2022. 5, 9

work page 2022
[16]

H. Liu, L. Lee, K. Lee, and P. Abbeel. Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431, 2022. 9

work page arXiv 2022
[17]

N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022. 9

work page arXiv 2022
[18]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023. 2, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

M. Minsky. Society of mind. Simon and Schuster, 1988. 1

work page 1988
[20]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. 9

work page internal anchor Pith review arXiv 2021
[21]

Chatgpt: Optimizing language models for dialogue, Dec 2022

OpenAI. Chatgpt: Optimizing language models for dialogue, Dec 2022. URL https:// openai.com/blog/chatgpt/. 2, 5

work page 2022
[22]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

S. Pichai. An important next step on our ai journey, Feb 2023. URL https://blog.google/ technology/ai/bard-google-ai-search-updates/ . 2, 9

work page 2023
[24]

N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019. 9

work page arXiv 1906
[25]

Reynolds and K

L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021. 9

work page 2021
[26]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023. 2, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. 7, 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. 9

work page Pith review arXiv 2022
[29]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 9 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Zelikman, Y

E. Zelikman, Y . Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. 9

work page 2022
[32]

A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. URL https://arxiv.org/abs/2204. 00598. 9

work page arXiv 2022
[33]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[34]

We further provide detailed experimental details on each dataset in Section A.2

9 12 A Appendix In this appendix, we provide additional analysis and visualizations of the debates used in the main paper in Section A.1. We further provide detailed experimental details on each dataset in Section A.2. A.1 Additional Results 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Debate Rounds 65 70 75 80 85 90 95Consensus Consensus vs Number of Debating Agents Shor...

work page
[35]

The Unix System,

<XXX> and make sure the chess move is valid in the current board state. Biographies Starting Give a bullet point biography of highlighting their contributions and achievements as a computer scientist, with each fact separated with a new line character. Debate Here are some bullet point biographies of <person> given by other agents: <other agent response> ...

work page 1999