hub Canonical reference

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch · 2023 · cs.CL · arXiv 2305.14325

Canonical reference. 86% of citing Pith papers cite this work as background.

78 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 78 citing papers arXiv PDF

abstract

Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 20 baseline 1 method 1

citation-polarity summary

background 19 baseline 1 support 1 use method 1

claims ledger

abstract Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings

co-cited works

representative citing papers

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

cs.MA · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.

AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmark transfer.

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.

Learning to Interrupt in Language-based Multi-agent Communication

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.

Multi-Modal Manipulation via Multi-Modal Policy Consensus

cs.RO · 2025-09-27 · unverdicted · novelty 7.0

A policy that factorizes into modality-specific diffusion models combined by a learned router network for adaptive multi-modal robotic manipulation.

Automated Design of Agentic Systems

cs.AI · 2024-08-15 · conditional · novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

cs.CL · 2024-05-29 · unverdicted · novelty 7.0

Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

cs.CL · 2023-08-21 · conditional · novelty 7.0

AgentVerse enables dynamic multi-agent collaboration among LLM agents to outperform single agents while revealing emergent social behaviors during task completion.

Measuring Faithfulness in Chain-of-Thought Reasoning

cs.AI · 2023-07-17 · conditional · novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.

NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

cs.CL · 2026-05-17 · conditional · novelty 6.0

NewsLens is a five-agent LLM pipeline that generates framing maps from news articles to expose ideological omissions and manipulation across geopolitical topics.

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.

citing papers explorer

Showing 50 of 78 citing papers.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems quant-ph · 2025-10-23 · accept · full · ref 20 · internal anchor
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 16 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments cs.HC · 2024-05-13 · conditional · none · ref 5 · internal anchor
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 10 · internal anchor
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels cs.LG · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.
Test-Time Hinting for Black-Box Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies cs.MA · 2026-05-12 · unverdicted · none · ref 5 · 2 links · internal anchor
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design cs.MA · 2026-05-09 · unverdicted · none · ref 19 · internal anchor
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmark transfer.
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning cs.AI · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL · 2026-05-02 · unverdicted · none · ref 11 · internal anchor
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs cs.AI · 2026-04-25 · unverdicted · none · ref 5 · internal anchor
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery cs.CR · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
Learning to Interrupt in Language-based Multi-agent Communication cs.CL · 2026-04-07 · unverdicted · none · ref 9 · internal anchor
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.
Multi-Modal Manipulation via Multi-Modal Policy Consensus cs.RO · 2025-09-27 · unverdicted · none · ref 29 · internal anchor
A policy that factorizes into modality-specific diffusion models combined by a learned router network for adaptive multi-modal robotic manipulation.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 150 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions cs.CL · 2024-05-29 · unverdicted · none · ref 68 · internal anchor
Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors cs.CL · 2023-08-21 · conditional · none · ref 1 · internal anchor
AgentVerse enables dynamic multi-agent collaboration among LLM agents to outperform single agents while revealing emergent social behaviors during task completion.
Measuring Faithfulness in Chain-of-Thought Reasoning cs.AI · 2023-07-17 · conditional · none · ref 7 · internal anchor
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation cs.CL · 2026-05-17 · conditional · none · ref 6 · internal anchor
NewsLens is a five-agent LLM pipeline that generates framing maps from news articles to expose ideological omissions and manipulation across geopolitical topics.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy cs.LG · 2026-05-13 · unverdicted · none · ref 10 · 2 links · internal anchor
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 7 · internal anchor
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification cs.AI · 2026-05-10 · unverdicted · none · ref 6 · 2 links · internal anchor
A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides adding 2.4pp.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems cs.AI · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration cs.AI · 2026-05-05 · unverdicted · none · ref 2 · internal anchor
Context injection in multi-agent design shows a crossover effect, improving exploration up to 20x on some tasks but reducing it by 46% on others, predicted by baseline exploration levels with r=-0.82.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA cs.AI · 2026-05-05 · unverdicted · none · ref 26 · internal anchor
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
Pact: A Choreographic Language for Agentic Ecosystems cs.PL · 2026-05-04 · unverdicted · none · ref 12 · internal anchor
Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning cs.CL · 2026-05-03 · unverdicted · none · ref 2 · internal anchor
Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact and FEVER.
AI Alignment via Incentives and Correction cs.LG · 2026-05-02 · unverdicted · none · ref 19 · 2 links · internal anchor
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM coding tasks.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness q-bio.NC · 2026-04-30 · unverdicted · none · ref 7 · internal anchor
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation cs.MA · 2026-04-29 · unverdicted · none · ref 8 · internal anchor
Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation cs.CR · 2026-04-24 · unverdicted · none · ref 38 · internal anchor
Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 6 · internal anchor
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks cs.CR · 2026-04-05 · unverdicted · none · ref 6 · internal anchor
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
World model inspired sarcasm reasoning with large language model agents cs.CL · 2025-12-30 · unverdicted · none · ref 26 · internal anchor
WM-SAR decomposes sarcasm into LLM-agent components, quantifies literal-normative inconsistency deterministically, and integrates it with intention via logistic regression to outperform prior sarcasm detectors on benchmarks.
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models cs.CL · 2025-10-09 · unverdicted · none · ref 6 · 2 links · internal anchor
GTD generates task-adaptive, sparse communication topologies for multi-LLM agents via guided iterative graph diffusion steered by a proxy model predicting accuracy, utility, and cost.
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems cs.AI · 2025-10-07 · unverdicted · none · ref 6 · internal anchor
ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 141 · internal anchor
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 30 · internal anchor
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 113 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models cs.IR · 2025-02-02 · unverdicted · none · ref 16 · internal anchor
RankFlow deploys four LLM roles in sequence to rewrite queries, generate pseudo-answers, summarize passages, and rerank candidates, outperforming prior methods on TREC-DL, BEIR, and NovelEval.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 173 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 6 · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 151 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration cs.CL · 2023-10-03 · conditional · none · ref 6 · internal anchor
DyLAN automatically selects and dynamically organizes LLM agents for collaboration, outperforming fixed-agent baselines on code generation, reasoning, and decision tasks with up to 25% accuracy gains on some MMLU subjects.
Large Language Models Cannot Self-Correct Reasoning Yet cs.CL · 2023-10-03 · unverdicted · none · ref 5 · internal anchor
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.

Improving Factuality and Reasoning in Language Models through Multiagent Debate

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer