Improving Factuality and Reasoning in Language Models through Multiagent Debate

Antonio Torralba; Igor Mordatch; Joshua B. Tenenbaum; Shuang Li; Yilun Du

arxiv: 2305.14325 · v1 · submitted 2023-05-23 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du , Shuang Li , Antonio Torralba , Joshua B. Tenenbaum , Igor Mordatch This is my paper

Pith reviewed 2026-05-12 02:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords multiagent debatelanguage modelsreasoningfactualityhallucinationspromptingmulti-round discussion

0 comments

The pith

Multiple language model instances improve their answers by debating proposals and reasoning over multiple rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce a method where several instances of a language model each suggest an answer and then discuss their reasoning in successive rounds until they agree on a final response. This leads to stronger results in areas like math problems and strategic games, along with fewer fabricated details or logical slips. A reader would find this relevant because the technique applies to off-the-shelf models using only text prompts, without needing to alter the underlying system. It points to a path for making AI-generated information more reliable through interaction rather than isolated generation. The approach stays uniform no matter the specific task.

Core claim

The central discovery is that a multiagent debate setup, in which distinct language model copies propose individual responses and then engage in iterative exchanges of arguments and critiques, produces a consensus answer that outperforms standard single-model outputs in mathematical and strategic reasoning tasks while also increasing the factual correctness of the content.

What carries the argument

The multi-round debate mechanism among multiple LLM instances that allows proposal, critique, and convergence on a final answer.

If this is right

Improved performance on mathematical reasoning tasks.
Better results in strategic reasoning scenarios.
Reduced incidence of incorrect factual statements and hallucinations.
Usable on any existing language model through prompting alone.
Same method works across different tasks without customization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The debate format might help in domains requiring creative problem solving by simulating diverse perspectives.
It suggests that error correction can emerge from interaction even if individual models share biases.
Further tests could examine whether the benefits persist when models are from different families or sizes.
This could inform designs for AI systems that incorporate internal deliberation steps.

Load-bearing premise

Debate among the models converges on the truth instead of creating agreement around a common mistake.

What would settle it

Running the debate process on a set of questions where all models start with the same wrong answer and checking if they still output that wrong answer at the end.

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a multi-agent debate framework in which multiple instances of an LLM generate initial answers and reasoning, then iteratively critique and refine each other's outputs over several rounds before converging on a final consensus answer. The method is evaluated on mathematical reasoning (e.g., GSM8K-style problems), strategic reasoning tasks, and factuality benchmarks, with the central claim that this 'society of minds' interaction yields substantial gains in accuracy and reduced hallucinations relative to standard single-prompt or few-shot baselines, using identical procedures across black-box models.

Significance. If the reported gains prove robust, the work offers a practical, training-free prompting technique that leverages inter-agent interaction to improve reasoning and factuality beyond what independent sampling provides. It extends prior ideas such as self-consistency and verification by introducing explicit debate, and its applicability to existing models without internal access or parameter changes makes it potentially impactful for real-world LLM deployment.

major comments (3)

[§4 and Table 2] §4 (Experimental Setup) and Table 2: the reported accuracy improvements on reasoning tasks are not accompanied by direct comparisons to strong baselines such as self-consistency sampling or majority vote over an equivalent number of independent generations; without these, it is impossible to determine whether the iterative debate supplies corrective signal beyond increased sampling.
[§5.2] §5.2 (Factuality Experiments): the evaluation of hallucination reduction relies on automatic metrics and human judgments whose inter-annotator agreement and statistical significance are not reported; this weakens the claim that debate specifically reduces fallacious answers rather than simply producing more plausible consensus text.
[§3.1] §3.1 (Debate Protocol): the description does not specify the exact prompt templates used for critique rounds or the tie-breaking rule when agents fail to reach consensus after the final round; these details are load-bearing for reproducibility and for isolating whether gains arise from the interaction itself.

minor comments (2)

[Abstract and §1] The abstract and introduction use the phrase 'significantly enhances' without defining the threshold or providing effect sizes; this should be qualified with reference to the specific tables.
[Figure 1] Figure 1 (debate diagram) would benefit from an example trace showing an actual correction that occurs across rounds rather than a schematic only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of the manuscript. We address each major comment below and have revised the paper accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [§4 and Table 2] §4 (Experimental Setup) and Table 2: the reported accuracy improvements on reasoning tasks are not accompanied by direct comparisons to strong baselines such as self-consistency sampling or majority vote over an equivalent number of independent generations; without these, it is impossible to determine whether the iterative debate supplies corrective signal beyond increased sampling.

Authors: We agree that direct comparisons to self-consistency (Wang et al. 2023) and majority voting over an equivalent number of independent generations are necessary to isolate the benefit of iterative critique. In the revised manuscript we have added these baselines to Table 2 and §4, using the same total number of model calls as our debate setup (e.g., 3 or 5 generations). The new results show that multi-agent debate still yields statistically significant gains over both self-consistency and majority vote on GSM8K and strategic reasoning tasks, indicating that the corrective signal arises from the interaction rather than sampling alone. We have also clarified the experimental controls in §4. revision: yes
Referee: [§5.2] §5.2 (Factuality Experiments): the evaluation of hallucination reduction relies on automatic metrics and human judgments whose inter-annotator agreement and statistical significance are not reported; this weakens the claim that debate specifically reduces fallacious answers rather than simply producing more plausible consensus text.

Authors: We acknowledge that reporting inter-annotator agreement and statistical significance is essential. In the revision we have added Cohen’s kappa scores for the human factuality annotations (reported in the new Table 5) and performed McNemar’s tests to establish statistical significance of the hallucination reduction. We also clarify that the automatic metrics are drawn from TruthfulQA and that the debate protocol explicitly prompts agents to critique factual errors, not merely to produce fluent text. These additions are now in §5.2. revision: yes
Referee: [§3.1] §3.1 (Debate Protocol): the description does not specify the exact prompt templates used for critique rounds or the tie-breaking rule when agents fail to reach consensus after the final round; these details are load-bearing for reproducibility and for isolating whether gains arise from the interaction itself.

Authors: We thank the referee for highlighting this reproducibility gap. The exact prompt templates for the critique rounds have been added to Appendix A. For the tie-breaking rule, when agents do not converge after the maximum number of rounds we fall back to majority vote over the final responses, with a uniform random choice in the event of a tie; this procedure is now stated explicitly in §3.1. These changes allow readers to replicate the interaction dynamics precisely. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical prompting method with independent experimental support

full rationale

The paper introduces a multi-agent debate prompting procedure for LLMs and evaluates it empirically on math, strategy, and factuality tasks. No equations, derivations, fitted parameters, or ansatzes are present. Claims rest on reported performance gains from black-box model experiments rather than any internal reduction to inputs or self-citation chains. The approach is self-contained against external benchmarks, with no load-bearing self-definitional steps or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms or invented entities are invoked; the work is an empirical prompting method whose assumptions (e.g., that models can usefully critique each other) are implicit and unstated in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 941 out tokens · 41044 ms · 2026-05-12T02:56:58.178009+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
quant-ph 2025-10 accept novelty 8.0 full

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructio...
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
cs.HC 2024-05 conditional novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences acros...
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
cs.LG 2026-05 unverdicted novelty 7.0

Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identificati...
Test-Time Hinting for Black-Box Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 7.0

Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
cs.MA 2026-05 unverdicted novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
cs.AI 2026-05 unverdicted novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
cs.AI 2026-04 unverdicted novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
cs.CR 2026-04 unverdicted novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
cs.AI 2026-04 unverdicted novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
Learning to Interrupt in Language-based Multi-agent Communication
cs.CL 2026-04 unverdicted novelty 7.0

HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
Multi-Modal Manipulation via Multi-Modal Policy Consensus
cs.RO 2025-09 unverdicted novelty 7.0

A policy that factorizes into modality-specific diffusion models combined by a learned router network for adaptive multi-modal robotic manipulation.
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
cs.CL 2024-05 unverdicted novelty 7.0

Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
cs.CL 2023-08 conditional novelty 7.0

AgentVerse enables dynamic multi-agent collaboration among LLM agents to outperform single agents while revealing emergent social behaviors during task completion.
Measuring Faithfulness in Chain-of-Thought Reasoning
cs.AI 2023-07 conditional novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation
cs.CL 2026-05 conditional novelty 6.0

NewsLens is a five-agent LLM pipeline that generates framing maps from news articles to expose ideological omissions and manipulation across geopolitical topics.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 po...
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 6.0

Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 unverdicted novelty 6.0

DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 conditional novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, wit...
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
cs.AI 2026-05 unverdicted novelty 6.0

A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides ...
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
cs.AI 2026-05 unverdicted novelty 6.0

A multi-agent council of Gemini agents using absence-based clinical rules achieves F1 0.406 for defense mechanism classification, placing second among 64 teams, with overrides from fine-tuned models adding 2.4pp.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration
cs.AI 2026-05 unverdicted novelty 6.0

Context injection in multi-agent design shows a crossover effect, improving exploration up to 20x on some tasks but reducing it by 46% on others, predicted by baseline exploration levels with r=-0.82.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
Pact: A Choreographic Language for Agentic Ecosystems
cs.PL 2026-05 unverdicted novelty 6.0

Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
cs.MA 2026-04 unverdicted novelty 6.0

Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation
cs.CR 2026-04 unverdicted novelty 6.0

Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology
cs.AI 2026-03 accept novelty 6.0

A 7x6 matrix classifies AI agent patterns into 27 types by combining cognitive functions and execution topologies, yielding five empirical laws linking task constraints to architectural choices.
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
cs.AI 2026-01 unverdicted novelty 6.0

Multi-agent actor-critic methods with a centralized critic improve decentralized LLM collaboration over Monte Carlo baselines in long-horizon and sparse-reward settings.
World model inspired sarcasm reasoning with large language model agents
cs.CL 2025-12 unverdicted novelty 6.0

WM-SAR decomposes sarcasm into LLM-agent components, quantifies literal-normative inconsistency deterministically, and integrates it with intention via logistic regression to outperform prior sarcasm detectors on benchmarks.
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
cs.CL 2025-10 unverdicted novelty 6.0

GTD generates task-adaptive, sparse communication topologies for multi-LLM agents via guided iterative graph diffusion steered by a proxy model predicting accuracy, utility, and cost.
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
cs.AI 2025-10 unverdicted novelty 6.0

ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
cs.CL 2025-09 unverdicted novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
cs.AI 2025-07 unverdicted novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
cs.AI 2025-07 conditional novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models
cs.IR 2025-02 unverdicted novelty 6.0

RankFlow deploys four LLM roles in sequence to rewrite queries, generate pseudo-answers, summarize passages, and rerank candidates, outperforming prior methods on TREC-DL, BEIR, and NovelEval.
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Mixture-of-Agents Enhances Large Language Model Capabilities
cs.CL 2024-06 unverdicted novelty 6.0

A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration
cs.CL 2023-10 conditional novelty 6.0

DyLAN automatically selects and dynamically organizes LLM agents for collaboration, outperforming fixed-agent baselines on code generation, reasoning, and decision tasks with up to 25% accuracy gains on some MMLU subjects.
Large Language Models Cannot Self-Correct Reasoning Yet
cs.CL 2023-10 unverdicted novelty 6.0

LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
cs.CL 2023-09 conditional novelty 6.0

DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
Cognitive Architectures for Language Agents
cs.AI 2023-09 accept novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 78 Pith papers · 17 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: A visual language model for few-shot learning. NeurIPS, 2022. URL https://arxiv.org/abs/2204.14198. 9

work page internal anchor Pith review arXiv 2022
[2]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Neural Information Processing Systems, 2017. 9

work page 2017
[3]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Y . Du, S. Li, and I. Mordatch. Compositional visual generation with energy based models. In Advances in Neural Information Processing Systems, 2020. 9

work page 2020
[5]

Y . Du, C. Durkan, R. Strudel, J. B. Tenenbaum, S. Dieleman, R. Fergus, J. Sohl-Dickstein, A. Doucet, and W. Grathwohl. Reduce, reuse, recycle: Compositional generation with energy- based diffusion models and mcmc. arXiv preprint arXiv:2302.11552, 2023. 9

work page arXiv 2023
[6]

Fsmosca/pgn-standard: Portable game notation specification and implementation guide

Fsmosca. Fsmosca/pgn-standard: Portable game notation specification and implementation guide. URL https://github.com/fsmosca/PGN-Standard. 5

work page
[7]

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. 9

work page internal anchor Pith review arXiv 2002
[8]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei. Ai safety via debate. arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review arXiv
[10]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. H. Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. 2, 6, 9 10

work page internal anchor Pith review arXiv 2022
[12]

N. Lee, W. Ping, P. Xu, M. Patwary, P. N. Fung, M. Shoeybi, and B. Catanzaro. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599, 2022. 9

work page 2022
[13]

Solving Quantitative Reasoning Problems with Language Models

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022. 9

work page internal anchor Pith review arXiv 2022
[14]

S. Li, Y . Du, J. B. Tenenbaum, A. Torralba, and I. Mordatch. Composing ensembles of pre-trained models via iterative consensus. arXiv preprint arXiv:2210.11522, 2022. 9

work page arXiv 2022
[15]

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 (6624):1092–1097, 2022. 5, 9

work page 2022
[16]

H. Liu, L. Lee, K. Lee, and P. Abbeel. Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431, 2022. 9

work page arXiv 2022
[17]

N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022. 9

work page arXiv 2022
[18]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023. 2, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

M. Minsky. Society of mind. Simon and Schuster, 1988. 1

work page 1988
[20]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. 9

work page internal anchor Pith review arXiv 2021
[21]

Chatgpt: Optimizing language models for dialogue, Dec 2022

OpenAI. Chatgpt: Optimizing language models for dialogue, Dec 2022. URL https:// openai.com/blog/chatgpt/. 2, 5

work page 2022
[22]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

S. Pichai. An important next step on our ai journey, Feb 2023. URL https://blog.google/ technology/ai/bard-google-ai-search-updates/ . 2, 9

work page 2023
[24]

N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019. 9

work page Pith review arXiv 1906
[25]

Reynolds and K

L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021. 9

work page 2021
[26]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023. 2, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. 7, 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. 9

work page Pith review arXiv 2022
[29]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 9 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Zelikman, Y

E. Zelikman, Y . Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. 9

work page 2022
[32]

A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. URL https://arxiv.org/abs/2204. 00598. 9

work page internal anchor Pith review arXiv 2022
[33]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[34]

We further provide detailed experimental details on each dataset in Section A.2

9 12 A Appendix In this appendix, we provide additional analysis and visualizations of the debates used in the main paper in Section A.1. We further provide detailed experimental details on each dataset in Section A.2. A.1 Additional Results 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Debate Rounds 65 70 75 80 85 90 95Consensus Consensus vs Number of Debating Agents Shor...

work page
[35]

The Unix System,

<XXX> and make sure the chess move is valid in the current board state. Biographies Starting Give a bullet point biography of highlighting their contributions and achievements as a computer scientist, with each fact separated with a new line character. Debate Here are some bullet point biographies of <person> given by other agents: <other agent response> ...

work page 1999

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: A visual language model for few-shot learning. NeurIPS, 2022. URL https://arxiv.org/abs/2204.14198. 9

work page internal anchor Pith review arXiv 2022

[2] [2]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Neural Information Processing Systems, 2017. 9

work page 2017

[3] [3]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Y . Du, S. Li, and I. Mordatch. Compositional visual generation with energy based models. In Advances in Neural Information Processing Systems, 2020. 9

work page 2020

[5] [5]

Y . Du, C. Durkan, R. Strudel, J. B. Tenenbaum, S. Dieleman, R. Fergus, J. Sohl-Dickstein, A. Doucet, and W. Grathwohl. Reduce, reuse, recycle: Compositional generation with energy- based diffusion models and mcmc. arXiv preprint arXiv:2302.11552, 2023. 9

work page arXiv 2023

[6] [6]

Fsmosca/pgn-standard: Portable game notation specification and implementation guide

Fsmosca. Fsmosca/pgn-standard: Portable game notation specification and implementation guide. URL https://github.com/fsmosca/PGN-Standard. 5

work page

[7] [7]

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. 9

work page internal anchor Pith review arXiv 2002

[8] [8]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei. Ai safety via debate. arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review arXiv

[10] [10]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. H. Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. 2, 6, 9 10

work page internal anchor Pith review arXiv 2022

[12] [12]

N. Lee, W. Ping, P. Xu, M. Patwary, P. N. Fung, M. Shoeybi, and B. Catanzaro. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599, 2022. 9

work page 2022

[13] [13]

Solving Quantitative Reasoning Problems with Language Models

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022. 9

work page internal anchor Pith review arXiv 2022

[14] [14]

S. Li, Y . Du, J. B. Tenenbaum, A. Torralba, and I. Mordatch. Composing ensembles of pre-trained models via iterative consensus. arXiv preprint arXiv:2210.11522, 2022. 9

work page arXiv 2022

[15] [15]

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 (6624):1092–1097, 2022. 5, 9

work page 2022

[16] [16]

H. Liu, L. Lee, K. Lee, and P. Abbeel. Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431, 2022. 9

work page arXiv 2022

[17] [17]

N. Liu, S. Li, Y . Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714, 2022. 9

work page arXiv 2022

[18] [18]

Self-Refine: Iterative Refinement with Self-Feedback

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023. 2, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

M. Minsky. Society of mind. Simon and Schuster, 1988. 1

work page 1988

[20] [20]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. 9

work page internal anchor Pith review arXiv 2021

[21] [21]

Chatgpt: Optimizing language models for dialogue, Dec 2022

OpenAI. Chatgpt: Optimizing language models for dialogue, Dec 2022. URL https:// openai.com/blog/chatgpt/. 2, 5

work page 2022

[22] [22]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

S. Pichai. An important next step on our ai journey, Feb 2023. URL https://blog.google/ technology/ai/bard-google-ai-search-updates/ . 2, 9

work page 2023

[24] [24]

N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019. 9

work page Pith review arXiv 1906

[25] [25]

Reynolds and K

L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021. 9

work page 2021

[26] [26]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023. 2, 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. 7, 14

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022. 9

work page Pith review arXiv 2022

[29] [29]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 9

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 9 11

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Zelikman, Y

E. Zelikman, Y . Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. 9

work page 2022

[32] [32]

A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. URL https://arxiv.org/abs/2204. 00598. 9

work page internal anchor Pith review arXiv 2022

[33] [33]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irv- ing. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[34] [34]

We further provide detailed experimental details on each dataset in Section A.2

9 12 A Appendix In this appendix, we provide additional analysis and visualizations of the debates used in the main paper in Section A.1. We further provide detailed experimental details on each dataset in Section A.2. A.1 Additional Results 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Debate Rounds 65 70 75 80 85 90 95Consensus Consensus vs Number of Debating Agents Shor...

work page

[35] [35]

The Unix System,

<XXX> and make sure the chess move is valid in the current board state. Biographies Starting Give a bullet point biography of highlighting their contributions and achievements as a computer scientist, with each fact separated with a new line character. Debate Here are some bullet point biographies of <person> given by other agents: <other agent response> ...

work page 1999