pith. machine review for the scientific record. sign in

arxiv: 2308.07201 · v1 · submitted 2023-08-14 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Authors on Pith no claims yet

Pith reviewed 2026-05-13 12:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords multi-agent debateLLM evaluationtext assessmentNLG tasksautomated evaluationChatEvalhuman-mimicking evaluationopen-ended questions
0
0 comments X

The pith

A multi-agent team of LLMs debates to evaluate generated text with human-like reliability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that single large language models fall short of human evaluators when scoring the quality of text outputs. It introduces a multi-agent debate framework in which several LLMs discuss and critique responses to open-ended questions and standard NLG tasks. The approach draws on the practice of human evaluation panels that use multiple annotators for better consensus. If the method works, automated evaluation becomes more consistent and less reliant on expensive human labor while handling nuanced judgments. Experiments show the resulting system, ChatEval, produces assessments that better align with human standards than isolated model scoring.

Core claim

ChatEval constructs a multi-agent referee team that allows a group of LLMs to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation tasks, transcending mere textual scoring to offer a human-mimicking evaluation process for reliable assessments.

What carries the argument

The multi-agent referee team that lets distinct LLMs exchange views and reach consensus on response quality.

If this is right

  • Evaluations on open-ended questions become more reliable without added human annotators.
  • Standard NLG tasks receive assessments that capture subtleties single models often miss.
  • The framework scales to intricate tasks by combining multiple models' strengths.
  • Labor and time costs for large-scale text evaluation drop while consistency rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The debate logs could serve as richer training signals for improving the evaluated models themselves.
  • Similar multi-agent structures might transfer to other LLM workflows such as planning or verification.
  • Optimal team composition and discussion length remain open parameters that future runs could tune.

Load-bearing premise

Performance gains come from genuine collaboration among the agents rather than from simply making more calls to one model or using better single-prompt instructions.

What would settle it

A controlled test in which one LLM receives the full transcript of the multi-agent discussion and produces evaluation scores that match or exceed the team's accuracy against human judgments.

read the original abstract

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ChatEval, a multi-agent debate framework in which multiple LLMs assume distinct referee roles, discuss, and collectively evaluate the quality of responses generated by other models on open-ended questions and standard NLG tasks. It claims that this collaborative process yields more reliable, human-mimicking assessments than conventional single-agent LLM prompting, with supporting experiments and publicly released code.

Significance. If the reported gains can be shown to arise specifically from agent interaction and role differentiation rather than from increased total inference budget, the work would offer a practical, scalable method for automated evaluation that reduces dependence on human annotators while preserving reliability. The public code release is a clear strength for reproducibility and follow-up research.

major comments (2)
  1. [Section 4] Experimental setup (Section 4): No control condition equates total LLM calls or token budget between ChatEval and single-agent baselines. A single-agent variant that issues the same number of sequential or repeated calls (with concatenated history) is required to isolate whether improvements stem from multi-agent debate structure rather than simply aggregating more model outputs.
  2. [Section 4] Results and analysis: The abstract asserts that ChatEval provides superior human-mimicking evaluation, yet the description supplies no concrete quantitative metrics (e.g., Pearson/Spearman correlation with human judgments, win rates, or statistical significance tests) or explicit single-agent baselines with matched compute. This leaves the central superiority claim only moderately supported.
minor comments (2)
  1. [Section 3] The roles and interaction protocol of the referee team are described at a high level; a concise diagram or pseudocode of one debate round would improve clarity of the multi-agent mechanism.
  2. [Section 2] Related-work discussion could more explicitly contrast ChatEval with prior multi-agent LLM frameworks (e.g., those using debate for reasoning) to highlight the novelty of the evaluation-specific application.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our experimental analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [Section 4] Experimental setup (Section 4): No control condition equates total LLM calls or token budget between ChatEval and single-agent baselines. A single-agent variant that issues the same number of sequential or repeated calls (with concatenated history) is required to isolate whether improvements stem from multi-agent debate structure rather than simply aggregating more model outputs.

    Authors: We acknowledge the importance of controlling for the total computational budget to ensure that the observed improvements are due to the multi-agent debate mechanism rather than increased inference calls. In the revised manuscript, we introduce a new single-agent baseline that performs an equivalent number of sequential LLM calls with concatenated history. Our updated experiments demonstrate that ChatEval maintains superior performance even under this matched-budget condition, thereby strengthening the evidence for the benefits of multi-agent interaction. revision: yes

  2. Referee: [Section 4] Results and analysis: The abstract asserts that ChatEval provides superior human-mimicking evaluation, yet the description supplies no concrete quantitative metrics (e.g., Pearson/Spearman correlation with human judgments, win rates, or statistical significance tests) or explicit single-agent baselines with matched compute. This leaves the central superiority claim only moderately supported.

    Authors: We appreciate this feedback on the presentation of results. While the full manuscript in Section 4 does include quantitative comparisons and human correlation metrics, we have revised the abstract to explicitly state key quantitative findings, including Pearson and Spearman correlations with human judgments, win rates against baselines, and statistical significance. Additionally, as noted in response to the first comment, we now include matched-compute single-agent baselines. These changes provide stronger support for the superiority claim. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical multi-agent evaluation framework

full rationale

The paper presents ChatEval as an empirical construction: a multi-agent debate setup for LLM-based evaluation on open-ended and NLG tasks. No equations, derivations, or fitted parameters appear in the provided text. Claims rest on experimental comparisons and public code rather than any self-referential reduction where a 'prediction' equals an input by construction. Self-citations are absent from the abstract and setup; the method does not invoke uniqueness theorems or smuggle ansatzes. This is a standard empirical proposal whose validity can be checked externally via replication, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can productively debate evaluation criteria and that the resulting consensus is more reliable than single-model output. No free parameters are explicitly fitted in the abstract; the main invented entity is the ChatEval referee team itself.

axioms (1)
  • domain assumption LLMs can effectively debate and reach consensus on text quality
    Invoked when the multi-agent framework is introduced as superior to single-agent methods.
invented entities (1)
  • ChatEval referee team no independent evidence
    purpose: Autonomous multi-agent discussion and evaluation of generated responses
    New system constructed in the paper; no independent evidence provided beyond the authors' experiments.

pith-pipeline@v0.9.0 · 5517 in / 1123 out tokens · 31435 ms · 2026-05-13T12:58:17.427214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  2. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 7.0

    Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...

  3. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  4. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

  5. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

  6. AI-Gram: When Visual Agents Interact in a Social Network

    cs.AI 2026-04 unverdicted novelty 7.0

    Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.

  7. Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.

  8. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  9. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 6.0

    Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...

  10. LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.

  11. When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews

    cs.CL 2026-05 unverdicted novelty 6.0

    Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.

  12. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  13. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

  14. Pact: A Choreographic Language for Agentic Ecosystems

    cs.PL 2026-05 unverdicted novelty 6.0

    Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.

  15. MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria

    cs.HC 2026-04 unverdicted novelty 6.0

    MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...

  16. TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    TeamFusion uses per-member proxy agents and iterative structured discussions to generate more representative and consensual team deliverables than direct aggregation in open-ended tasks.

  17. PARM: Pipeline-Adapted Reward Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.

  18. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

    cs.AI 2026-04 unverdicted novelty 6.0

    SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.

  19. Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

    cs.MA 2026-04 unverdicted novelty 6.0

    HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

  20. Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.

  21. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  22. Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...

  23. TRUST: A Framework for Decentralized AI Service v.0.1

    cs.AI 2026-04 unverdicted novelty 5.0

    TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...

  24. Emergent Social Intelligence Risks in Generative Multi-Agent Systems

    cs.MA 2026-03 unverdicted novelty 5.0

    Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.

  25. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    cs.CL 2023-05 conditional novelty 5.0

    Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.

  26. Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

    cs.MA 2026-05 unverdicted novelty 4.0

    Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.

  27. Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

    cs.MA 2026-05 unverdicted novelty 4.0

    Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.

  28. BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

    cs.CL 2026-04 unverdicted novelty 4.0

    BLUEmed combines hybrid RAG with structured multi-agent debate and a safety filter to detect terminology substitution errors in clinical notes, reaching 69.13% accuracy under few-shot prompting and outperforming singl...

  29. Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization

    cs.AI 2026-04 unverdicted novelty 4.0

    Deterministic multi-agent intent routing can reduce hallucinations in generative engines to near zero by limiting LLMs to intent routers and handing off tasks to specialized agents.

  30. EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

    cs.AI 2026-04 unverdicted novelty 4.0

    EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

  31. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  32. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 28 Pith papers · 8 internal anchors

  1. [1]

    Benchmarking foundation models with language-model-as-an-examiner

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181,

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk

    Chris Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing, pp. 286–295,

  4. [5]

    Cheng-Han Chiang and Hung-yi Lee

    URL https://arxiv.org/abs/2006.14799. Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalu- ations? arXiv preprint arXiv:2305.01937,

  5. [6]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    11 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023),

  6. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  7. [8]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

  8. [9]

    Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166,

  9. [10]

    Human-like summarization evaluation with chatgpt

    Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554,

  10. [11]

    The perils of using mechanical turk to evaluate open-ended text generation

    Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835,

  11. [12]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023a. Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arX...

  12. [13]

    Training socially aligned language models in simulated human society

    Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society.arXiv preprint arXiv:2305.16960, 2023a. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment...

  13. [14]

    Roco: Dialectic multi- robot collaboration with large language models

    Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738,

  14. [15]

    Usr: An unsupervised and reference free evaluation metric for dialog generation

    Shikib Mehri and Maxine Eskenazi. Usr: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456,

  15. [16]

    Why we need new evaluation metrics for NLG

    Jekaterina Novikova, Ond ˇrej Du ˇsek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing , pp. 2241–2252, Copenhagen, Denmark, September

  16. [17]

    Why We Need New Evaluation Metrics for NLG

    Association for Computational Linguistics. doi: 10.18653/v1/D17-1238. URL https:// aclanthology.org/D17-1238. 12 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Inform...

  17. [18]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442,

  18. [19]

    ChatDev: Communicative Agents for Software Development

    Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924,

  19. [20]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, An- toine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207,

  20. [21]

    Bleurt: Learning robust metrics for text gener- ation

    Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696,

  21. [22]

    Are large language models good evalu- ators for abstractive summarization? arXiv preprint arXiv:2305.13091,

    Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing. Are large language models good evalu- ators for abstractive summarization? arXiv preprint arXiv:2305.13091,

  22. [23]

    Is chatgpt a good nlg evaluator? a preliminary study

    Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023a. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv pre...

  23. [24]

    Jason Weston and Sainbayar Sukhbaatar

    Ning Wu, Ming Gong, Linjun Shou, Shining Liang, and Daxin Jiang. Large language models are diverse role-players for summarization evaluation. arXiv preprint arXiv:2303.15078,

  24. [25]

    BERTScore: Evaluating Text Generation with BERT

    13 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluat- ing text generation with bert. arXiv preprint arXiv:1904.09675,

  25. [26]

    Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance

    Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622,

  26. [27]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

  27. [28]

    Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,

    Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,

  28. [29]

    (2023) and design several different role descriptions as follows

    A PROMPT TEMPLATE AND DIVERSE ROLE PROMPT The overall prompt template is shown in Table 6, we draw inspiration from Wu et al. (2023) and design several different role descriptions as follows. General Public You are now General Public, one of the referees in this task. You are interested in the story and looking for updates on the investigation. Please thi...