ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan; Jianxuan Yu; Jie Fu; Shanghang Zhang; Wei Xue; Weize Chen; Yusheng Su; Zhiyuan Liu

arxiv: 2308.07201 · v1 · submitted 2023-08-14 · 💻 cs.CL

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan , Weize Chen , Yusheng Su , Jianxuan Yu , Wei Xue , Shanghang Zhang , Jie Fu , Zhiyuan Liu This is my paper

Pith reviewed 2026-05-13 12:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent debateLLM evaluationtext assessmentNLG tasksautomated evaluationChatEvalhuman-mimicking evaluationopen-ended questions

0 comments

The pith

A multi-agent team of LLMs debates to evaluate generated text with human-like reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that single large language models fall short of human evaluators when scoring the quality of text outputs. It introduces a multi-agent debate framework in which several LLMs discuss and critique responses to open-ended questions and standard NLG tasks. The approach draws on the practice of human evaluation panels that use multiple annotators for better consensus. If the method works, automated evaluation becomes more consistent and less reliant on expensive human labor while handling nuanced judgments. Experiments show the resulting system, ChatEval, produces assessments that better align with human standards than isolated model scoring.

Core claim

ChatEval constructs a multi-agent referee team that allows a group of LLMs to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation tasks, transcending mere textual scoring to offer a human-mimicking evaluation process for reliable assessments.

What carries the argument

The multi-agent referee team that lets distinct LLMs exchange views and reach consensus on response quality.

If this is right

Evaluations on open-ended questions become more reliable without added human annotators.
Standard NLG tasks receive assessments that capture subtleties single models often miss.
The framework scales to intricate tasks by combining multiple models' strengths.
Labor and time costs for large-scale text evaluation drop while consistency rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The debate logs could serve as richer training signals for improving the evaluated models themselves.
Similar multi-agent structures might transfer to other LLM workflows such as planning or verification.
Optimal team composition and discussion length remain open parameters that future runs could tune.

Load-bearing premise

Performance gains come from genuine collaboration among the agents rather than from simply making more calls to one model or using better single-prompt instructions.

What would settle it

A controlled test in which one LLM receives the full transcript of the multi-agent discussion and produces evaluation scores that match or exceed the team's accuracy against human judgments.

read the original abstract

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ChatEval, a multi-agent debate framework in which multiple LLMs assume distinct referee roles, discuss, and collectively evaluate the quality of responses generated by other models on open-ended questions and standard NLG tasks. It claims that this collaborative process yields more reliable, human-mimicking assessments than conventional single-agent LLM prompting, with supporting experiments and publicly released code.

Significance. If the reported gains can be shown to arise specifically from agent interaction and role differentiation rather than from increased total inference budget, the work would offer a practical, scalable method for automated evaluation that reduces dependence on human annotators while preserving reliability. The public code release is a clear strength for reproducibility and follow-up research.

major comments (2)

[Section 4] Experimental setup (Section 4): No control condition equates total LLM calls or token budget between ChatEval and single-agent baselines. A single-agent variant that issues the same number of sequential or repeated calls (with concatenated history) is required to isolate whether improvements stem from multi-agent debate structure rather than simply aggregating more model outputs.
[Section 4] Results and analysis: The abstract asserts that ChatEval provides superior human-mimicking evaluation, yet the description supplies no concrete quantitative metrics (e.g., Pearson/Spearman correlation with human judgments, win rates, or statistical significance tests) or explicit single-agent baselines with matched compute. This leaves the central superiority claim only moderately supported.

minor comments (2)

[Section 3] The roles and interaction protocol of the referee team are described at a high level; a concise diagram or pseudocode of one debate round would improve clarity of the multi-agent mechanism.
[Section 2] Related-work discussion could more explicitly contrast ChatEval with prior multi-agent LLM frameworks (e.g., those using debate for reasoning) to highlight the novelty of the evaluation-specific application.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our experimental analysis. We address each major comment below.

read point-by-point responses

Referee: [Section 4] Experimental setup (Section 4): No control condition equates total LLM calls or token budget between ChatEval and single-agent baselines. A single-agent variant that issues the same number of sequential or repeated calls (with concatenated history) is required to isolate whether improvements stem from multi-agent debate structure rather than simply aggregating more model outputs.

Authors: We acknowledge the importance of controlling for the total computational budget to ensure that the observed improvements are due to the multi-agent debate mechanism rather than increased inference calls. In the revised manuscript, we introduce a new single-agent baseline that performs an equivalent number of sequential LLM calls with concatenated history. Our updated experiments demonstrate that ChatEval maintains superior performance even under this matched-budget condition, thereby strengthening the evidence for the benefits of multi-agent interaction. revision: yes
Referee: [Section 4] Results and analysis: The abstract asserts that ChatEval provides superior human-mimicking evaluation, yet the description supplies no concrete quantitative metrics (e.g., Pearson/Spearman correlation with human judgments, win rates, or statistical significance tests) or explicit single-agent baselines with matched compute. This leaves the central superiority claim only moderately supported.

Authors: We appreciate this feedback on the presentation of results. While the full manuscript in Section 4 does include quantitative comparisons and human correlation metrics, we have revised the abstract to explicitly state key quantitative findings, including Pearson and Spearman correlations with human judgments, win rates against baselines, and statistical significance. Additionally, as noted in response to the first comment, we now include matched-compute single-agent baselines. These changes provide stronger support for the superiority claim. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical multi-agent evaluation framework

full rationale

The paper presents ChatEval as an empirical construction: a multi-agent debate setup for LLM-based evaluation on open-ended and NLG tasks. No equations, derivations, or fitted parameters appear in the provided text. Claims rest on experimental comparisons and public code rather than any self-referential reduction where a 'prediction' equals an input by construction. Self-citations are absent from the abstract and setup; the method does not invoke uniqueness theorems or smuggle ansatzes. This is a standard empirical proposal whose validity can be checked externally via replication, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can productively debate evaluation criteria and that the resulting consensus is more reliable than single-model output. No free parameters are explicitly fitted in the abstract; the main invented entity is the ChatEval referee team itself.

axioms (1)

domain assumption LLMs can effectively debate and reach consensus on text quality
Invoked when the multi-agent framework is introduced as superior to single-agent methods.

invented entities (1)

ChatEval referee team no independent evidence
purpose: Autonomous multi-agent discussion and evaluation of generated responses
New system constructed in the paper; no independent evidence provided beyond the authors' experiments.

pith-pipeline@v0.9.0 · 5517 in / 1123 out tokens · 31435 ms · 2026-05-13T12:58:17.427214+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 7.0

Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
AI-Gram: When Visual Agents Interact in a Social Network
cs.AI 2026-04 unverdicted novelty 7.0

Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
cs.CL 2026-04 unverdicted novelty 7.0

OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
cs.CL 2026-04 unverdicted novelty 7.0

Introduces OmniBehavior benchmark from real-world data and shows LLMs exhibit hyper-activity, persona homogenization, and utopian bias in behavior simulation.
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
cs.SE 2026-02 conditional novelty 7.0

SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate
cs.CL 2026-01 unverdicted novelty 7.0

SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits
physics.soc-ph 2026-05 accept novelty 6.0

Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-...
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 6.0

Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews
cs.CL 2026-05 unverdicted novelty 6.0

Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Pact: A Choreographic Language for Agentic Ecosystems
cs.PL 2026-05 unverdicted novelty 6.0

Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
cs.HC 2026-04 unverdicted novelty 6.0

MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

TeamFusion uses per-member proxy agents and iterative structured discussions to generate more representative and consensual team deliverables than direct aggregation in open-ended tasks.
PARM: Pipeline-Adapted Reward Model
cs.AI 2026-04 unverdicted novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
Token-Level LLM Collaboration via FusionRoute
cs.AI 2026-01 unverdicted novelty 6.0

FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and mergi...
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
cs.MA 2025-07 accept novelty 6.0

TinyTroupe provides a toolkit for fine-grained persona-based LLM multi-agent simulations with built-in support for population sampling, experimentation, and validation.
Mixture-of-Agents Enhances Large Language Model Capabilities
cs.CL 2024-06 unverdicted novelty 6.0

A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
Cognitive Architectures for Language Agents
cs.AI 2023-09 accept novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 5.0

Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...
TRUST: A Framework for Decentralized AI Service v.0.1
cs.AI 2026-04 unverdicted novelty 5.0

TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization
cs.CL 2025-11 unverdicted novelty 5.0

MAGIC-HMO is a multi-agent framework that treats Chinese short-form creative NLG as heterogeneous multi-objective optimization over personalized constraints plus explanation reliability and outperforms baselines on a ...
ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction
cs.CL 2025-11 unverdicted novelty 5.0

ZoFia is a zero-shot fake news detection framework that uses hierarchical entity salience retrieval followed by multi-LLM adversarial debate to improve robustness over single-model approaches.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
cs.CL 2025-08 unverdicted novelty 5.0

GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
cs.CL 2023-05 conditional novelty 5.0

Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 4.0

Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 4.0

Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection
cs.CL 2026-04 unverdicted novelty 4.0

BLUEmed combines hybrid RAG with structured multi-agent debate and a safety filter to detect terminology substitution errors in clinical notes, reaching 69.13% accuracy under few-shot prompting and outperforming singl...
Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization
cs.AI 2026-04 unverdicted novelty 4.0

Deterministic multi-agent intent routing can reduce hallucinations in generative engines to near zero by limiting LLMs to intent routers and handing off tasks to specialized agents.
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping
cs.AI 2026-04 unverdicted novelty 4.0

EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Bridging Language Models and Financial Analysis
q-fin.ST 2025-03 unverdicted novelty 2.0

A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
LLM Multi-Agent Systems: Challenges and Open Problems
cs.MA 2024-02 unverdicted novelty 2.0

The paper identifies inadequately addressed challenges in optimizing task allocation, fostering robust reasoning through debates, managing layered context, enhancing memory, and applying multi-agent systems to blockchain.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 43 Pith papers · 9 internal anchors

[1]

Benchmarking foundation models with language-model-as-an-examiner

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181,

work page arXiv
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk

Chris Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing, pp. 286–295,

work page 2009
[5]

arXiv preprint arXiv:2006.14799 , year=

URL https://arxiv.org/abs/2006.14799. Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalu- ations? arXiv preprint arXiv:2305.01937,

work page arXiv 2006
[6]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

11 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023),

work page 2023
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166,

work page internal anchor Pith review arXiv
[10]

Human-like summarization evaluation with chatgpt

Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554,

work page arXiv
[11]

The perils of using mechanical turk to evaluate open-ended text generation

Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835,

work page arXiv
[12]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023a. Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arX...

work page internal anchor Pith review arXiv
[13]

M., Yang, D., and V osoughi, S

Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society.arXiv preprint arXiv:2305.16960, 2023a. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment...

work page arXiv
[14]

Roco: Dialectic multi-robot collaboration with large language models, 2023

Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738,

work page arXiv
[15]

Usr: An unsupervised and reference free evaluation metric for dialog generation

Shikib Mehri and Maxine Eskenazi. Usr: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456,

work page arXiv 2005
[16]

Why we need new evaluation metrics for NLG

Jekaterina Novikova, Ond ˇrej Du ˇsek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing , pp. 2241–2252, Copenhagen, Denmark, September

work page 2017
[17]

Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere

Association for Computational Linguistics. doi: 10.18653/v1/D17-1238. URL https:// aclanthology.org/D17-1238. 12 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Inform...

work page doi:10.18653/v1/d17-1238
[18]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ChatDev: Communicative Agents for Software Development

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, An- toine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207,

work page internal anchor Pith review arXiv
[21]

Bleurt: Learning robust metrics for text generation,

Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696,

work page arXiv 2004
[22]

Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing. Are large language models good evalu- ators for abstractive summarization? arXiv preprint arXiv:2305.13091,

work page arXiv
[23]

Is ChatGPT a good NLG evaluator? A preliminary study,

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023a. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv pre...

work page arXiv
[24]

Large language models are diverse role-players for summarization evaluation

Ning Wu, Ming Gong, Linjun Shou, Shining Liang, and Daxin Jiang. Large language models are diverse role-players for summarization evaluation. arXiv preprint arXiv:2303.15078,

work page arXiv
[25]

BERTScore: Evaluating Text Generation with BERT

13 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluat- ing text generation with bert. arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[26]

Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622,

work page arXiv 1909
[27]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2210.07197 , year=

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,

work page arXiv
[29]

(2023) and design several different role descriptions as follows

A PROMPT TEMPLATE AND DIVERSE ROLE PROMPT The overall prompt template is shown in Table 6, we draw inspiration from Wu et al. (2023) and design several different role descriptions as follows. General Public You are now General Public, one of the referees in this task. You are interested in the story and looking for updates on the investigation. Please thi...

work page 2023

[1] [1]

Benchmarking foundation models with language-model-as-an-examiner

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181,

work page arXiv

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901

[3] [3]

Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk

Chris Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing, pp. 286–295,

work page 2009

[4] [5]

arXiv preprint arXiv:2006.14799 , year=

URL https://arxiv.org/abs/2006.14799. Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalu- ations? arXiv preprint arXiv:2305.01937,

work page arXiv 2006

[5] [6]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

11 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023),

work page 2023

[6] [7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166,

work page internal anchor Pith review arXiv

[9] [10]

Human-like summarization evaluation with chatgpt

Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554,

work page arXiv

[10] [11]

The perils of using mechanical turk to evaluate open-ended text generation

Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835,

work page arXiv

[11] [12]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023a. Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arX...

work page internal anchor Pith review arXiv

[12] [13]

M., Yang, D., and V osoughi, S

Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society.arXiv preprint arXiv:2305.16960, 2023a. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment...

work page arXiv

[13] [14]

Roco: Dialectic multi-robot collaboration with large language models, 2023

Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738,

work page arXiv

[14] [15]

Usr: An unsupervised and reference free evaluation metric for dialog generation

Shikib Mehri and Maxine Eskenazi. Usr: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456,

work page arXiv 2005

[15] [16]

Why we need new evaluation metrics for NLG

Jekaterina Novikova, Ond ˇrej Du ˇsek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing , pp. 2241–2252, Copenhagen, Denmark, September

work page 2017

[16] [17]

Orienteering in an Information Land- scape:HowInformationSeekersGetfromHeretoThere

Association for Computational Linguistics. doi: 10.18653/v1/D17-1238. URL https:// aclanthology.org/D17-1238. 12 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Inform...

work page doi:10.18653/v1/d17-1238

[17] [18]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

ChatDev: Communicative Agents for Software Development

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [20]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, An- toine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207,

work page internal anchor Pith review arXiv

[20] [21]

Bleurt: Learning robust metrics for text generation,

Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696,

work page arXiv 2004

[21] [22]

Are large language models good evaluators for abstractive summarization? arXiv preprint arXiv:2305.13091, 2023

Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing. Are large language models good evalu- ators for abstractive summarization? arXiv preprint arXiv:2305.13091,

work page arXiv

[22] [23]

Is ChatGPT a good NLG evaluator? A preliminary study,

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023a. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv pre...

work page arXiv

[23] [24]

Large language models are diverse role-players for summarization evaluation

Ning Wu, Ming Gong, Linjun Shou, Shining Liang, and Daxin Jiang. Large language models are diverse role-players for summarization evaluation. arXiv preprint arXiv:2303.15078,

work page arXiv

[24] [25]

BERTScore: Evaluating Text Generation with BERT

13 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluat- ing text generation with bert. arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[25] [26]

Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622,

work page arXiv 1909

[26] [27]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [28]

arXiv preprint arXiv:2210.07197 , year=

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,

work page arXiv

[28] [29]

(2023) and design several different role descriptions as follows

A PROMPT TEMPLATE AND DIVERSE ROLE PROMPT The overall prompt template is shown in Table 6, we draw inspiration from Wu et al. (2023) and design several different role descriptions as follows. General Public You are now General Public, one of the referees in this task. You are interested in the story and looking for updates on the investigation. Please thi...

work page 2023