Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Julia Hockenmaier; Rajarshi Haldar

arxiv: 2510.27106 · v1 · submitted 2025-10-31 · 💻 cs.CL

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Rajarshi Haldar , Julia Hockenmaier This is my paper

Pith reviewed 2026-05-18 03:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM judgesintra-rater reliabilityNLG evaluationjudgment varianceLLM-as-a-judgerating inconsistencyevaluation metricsself-consistency

0 comments

The pith

LLM judges assign inconsistent scores to identical outputs across repeated runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the growing practice of using large language models to evaluate natural language generation outputs, a method that often matches human preferences better than older metrics. Experiments reveal that these LLM judges frequently change the scores they give to the exact same text when run multiple times independently. This low consistency can make the ratings appear nearly arbitrary, which hinders any attempt to validate how accurate the judgments really are. The authors measure the extent of this problem across several NLG tasks and benchmarks while checking whether guidelines can still allow useful application of LLM judges.

Core claim

LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. The inconsistency is quantified across different NLG tasks and benchmarks to determine if judicious use of LLM judges can remain viable with proper guidelines.

What carries the argument

Intra-rater reliability measured by variance in scores assigned by the same LLM judge to identical inputs across independent runs.

If this is right

Single-run LLM judgments may not support stable comparisons between different models or systems.
Averaging scores over multiple runs could reduce variance and improve practical usability.
Inconsistency levels differ across NLG tasks, so some benchmarks may need more caution than others.
Explicit guidelines for LLM judge usage can help retain value while managing the observed variance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fixing temperature to zero or using deterministic sampling might narrow but not remove the variance.
The finding points toward ensemble or fine-tuned judge variants as possible ways to increase consistency.
Similar self-inconsistency could affect other LLM decision tasks that require repeatable outputs.

Load-bearing premise

That repeated independent runs of the same LLM judge on identical inputs provide a valid measure of intra-rater reliability rather than reflecting prompt sampling noise or temperature effects.

What would settle it

Finding that scores remain identical or nearly identical across many repeated runs under fixed temperature and prompt conditions would contradict the reported low reliability.

Figures

Figures reproduced from arXiv: 2510.27106 by Julia Hockenmaier, Rajarshi Haldar.

**Figure 2.** Figure 2: Annotating a summary from the SummEval benchmark with scores ranging from 1 to 5 on the metrics: coherence, consistency, fluency and relevance. SummEval This is a summarization benchmark with 1700 examples where judges rate modelgenerated summaries of source documents on a 1–5 scale across four metrics: coherence (how well the sentences in the summary fit together), consistency (the factual accuracy of … view at source ↗

**Figure 1.** Figure 1: Annotating a summary of an article in Sum [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Ranking conversations from MT-bench with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Self-reliability of LLM judges on SummEval. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Balanced Accuracies of different LLM judges [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Inter-rater reliability of LLM judges against [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 6.** Figure 6: Inter-rater reliability within and across both [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for each run in SummaC bench [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Prompts For Evaluating Generated Summaries From SummEval Using Four Metrics [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for Evaluating Two Generated Responses to Queries Labeled [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for Evaluating Two Generated Responses to Queries Labeled [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Distributions of scores assigned by different raters across all metrics in the SummEval dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLM judges give varying scores on repeated runs for NLG tasks, but the variance likely traces to non-zero temperature sampling rather than genuine evaluative instability.

read the letter

The main point is that LLM judges produce inconsistent scores when the same input is fed to them multiple times, and the authors measure this across several NLG benchmarks. They treat the spread as evidence that single-run judgments are too shaky to trust for reliable evaluation. This is a straightforward empirical observation that lines up with what a lot of people have seen in practice when they try to use these models as scorers. The work does a reasonable job of documenting the effect size of the variance and noting that it shows up across different tasks, which gives the finding some breadth. It also tries to offer guidelines for when LLM judges might still be usable if you average runs or add other checks. That part is useful for anyone already relying on these setups in their pipelines. The soft spot is the missing control for decoding parameters. If the experiments ran with default temperature above zero, the score differences are exactly what you expect from token sampling noise, not proof that the underlying judgment mapping is unstable. The paper would need explicit temperature-zero runs with fixed seeds to separate those two sources. Without that, the central claim about intra-rater unreliability rests on an uncontrolled variable. The methods description in the abstract is thin on run counts and statistical checks, which makes it hard to judge how robust the numbers are. This paper is for groups that build or use LLM-based evaluation for generated text. Readers who want a practical heads-up on current evaluation weaknesses will get something out of it, though they should treat the reliability conclusion as preliminary. It is worth sending to peer review so the authors can add the deterministic controls and tighten the statistics; the underlying issue matters enough that a revised version could be worth citing once those gaps are closed.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates self-inconsistency in LLM-as-a-judge setups for NLG evaluation. It claims that LLM judges exhibit low intra-rater reliability, shown through variance in assigned scores across repeated independent runs on identical inputs, which can render ratings inconsistent or arbitrary. The authors quantify this phenomenon across multiple NLG tasks and benchmarks and discuss conditions under which LLM judges may still be used judiciously.

Significance. If the observed variance reflects genuine instability in the judge's evaluative mapping rather than sampling artifacts, the result would be significant for automated evaluation research, where LLM judges are increasingly adopted for their alignment with human preferences over n-gram or embedding metrics. The cross-task quantification provides useful breadth, though the work would benefit from stronger controls to secure the reliability interpretation.

major comments (2)

[Abstract] Abstract: the claim that experiments demonstrate low intra-rater reliability rests on observed score variance, but the abstract (and presumably the experimental description) provides no details on the number of runs, statistical tests, effect sizes, or controls for prompt variation. This leaves the central empirical claim without visible supporting evidence and is load-bearing for the paper's contribution.
[Experimental setup] Experimental setup: the central claim interprets score variance across repeated runs on identical inputs as evidence of intra-rater unreliability. If the runs use temperature > 0 or default stochastic decoding, this variance is the expected outcome of token sampling and does not demonstrate instability in the judge's evaluative process. The manuscript must report results at temperature=0 with fixed seeds to support the reliability interpretation.

minor comments (2)

[Related Work] Add explicit discussion of prior work on LLM judge consistency and reliability metrics to better situate the contribution.
Ensure that any tables or figures reporting score distributions include error bars, run counts, and clear statistical comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript investigating self-inconsistency in LLM-as-a-judge frameworks. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that experiments demonstrate low intra-rater reliability rests on observed score variance, but the abstract (and presumably the experimental description) provides no details on the number of runs, statistical tests, effect sizes, or controls for prompt variation. This leaves the central empirical claim without visible supporting evidence and is load-bearing for the paper's contribution.

Authors: We agree that the abstract would benefit from greater specificity to make the empirical support transparent. In the revised version, we will expand the abstract to report the number of repeated runs per input (10), the primary statistical measures (mean, standard deviation, and variance), relevant effect sizes, and explicit confirmation that input prompts and task instructions were held fixed across runs. revision: yes
Referee: [Experimental setup] Experimental setup: the central claim interprets score variance across repeated runs on identical inputs as evidence of intra-rater unreliability. If the runs use temperature > 0 or default stochastic decoding, this variance is the expected outcome of token sampling and does not demonstrate instability in the judge's evaluative process. The manuscript must report results at temperature=0 with fixed seeds to support the reliability interpretation.

Authors: We acknowledge the referee's point that stochastic decoding can introduce sampling variance unrelated to evaluative instability. To isolate the judge's scoring behavior, we will add a new set of experiments conducted at temperature=0 with fixed random seeds and report the resulting score distributions and variance statistics alongside the original results. This will allow readers to distinguish sampling effects from any inherent inconsistency in the LLM judge. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports experimental measurements of score variance across repeated LLM judge runs on identical inputs. No derivations, equations, fitted parameters, or self-citations are present in the provided text or abstract. The central claim rests on direct observation of inconsistency rather than any reduction of a 'prediction' or 'result' to its own inputs by construction. This is a standard empirical reliability study whose validity can be assessed against external benchmarks (e.g., temperature=0 controls) without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work appears to rest on standard empirical assumptions about benchmark validity and run independence.

pith-pipeline@v0.9.0 · 5639 in / 983 out tokens · 25606 ms · 2026-05-18T03:29:47.183680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We ran each judge LLM on the same set of generations independently for three runs... computed intra-rater reliability using Krippendorff’s Alpha.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLM judges have low intra-rater reliability in their assigned scores across different runs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

Preprint, arXiv:2006.14799

Evaluation of text generation: A survey. Preprint, arXiv:2006.14799. Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. 2025. Judgelrm: Large reasoning models as a judge. Preprint, arXiv:2504.00050. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evalua- tions? InProceedings...

work page arXiv 2006
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Lance Eliot. 2025. Why doing chain-of-thought prompt- ing in reasoning llms gums up the works. Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan Mc- Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summariz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4.Preprint, arXiv:2403.02839. T. K. Koo and M. Y . Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for re- liability research.Journal of Chiropractic Medicine, 15(2):155–163. Erratum in: J Chiropr Med. 20...

work page doi:10.1016/j.jcm.2017.10.001 2016
[4]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chen...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

GPT-4 Technical Report

Intra and inter-rater reliability of screening for movement impairments: Movement control tests from the foundation matrix.Journal of Sports Sci- ence and Medicine, 14(2):427–440. Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt LLM evaluation. Transactions of the Asso...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Black-box uncertainty quantification method for llm-as-a-judge.Preprint, arXiv:2410.11594. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models.Preprint, arXiv:2203.11171. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten B...

work page arXiv 2023
[7]

Qwen3 Technical Report

Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measure- ment theory. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 10967–10982, Singapore. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Hallucinations: Information added not in the source

work page
[9]

Contradictions: Statements opposing source content

work page
[10]

Entity Errors: Incorrect names/roles/locations

work page
[11]

Omissions: Key points missing from the summary

work page
[12]

Output: A single number0for consistent sum- mary and1for inconsistent summary

Temporal Errors: Wrong se- quence/timeframe of events. Output: A single number0for consistent sum- mary and1for inconsistent summary. Document:{{Full source text}} Summary:{{Generated Summary}} Figure 8: Prompt used for each run in SummaC bench- mark. ForSummEval, there are four different metrics, coherence, consistency, fluency and relevance. For each ru...

work page 1960
[13]

Read article and identify key points

work page
[14]

Check if summary presents them clearly and logically

work page
[15]

Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Coherence: (a) Coherence Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Consistency (1–5)– the summary should not contradict the source; penalize hallucinated fac...

work page
[16]

Read article and summary

work page
[17]

Identify any factual errors

work page
[18]

Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Consistency: (b) Consistency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Fluency (1–5)– grammar, spelling, punctuation, word choice, and sentence structure. Ev...

work page
[19]

Identify language issues affecting readability

work page
[20]

Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article

Score 1–5. Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Relevance (1–5)– includes only important information from the source; penalize redundancy. Evaluation Steps:

work page
[21]

Read summary and article

work page
[22]

Assess coverage of key points

work page
[23]

[[A]]" if assistant A is better,

Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Relevance: (d) Relevance Figure 9: Prompts For Evaluating Generated Summaries From SummEval Using Four Metrics Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. Your evaluation ...

work page 2019

[1] [1]

Preprint, arXiv:2006.14799

Evaluation of text generation: A survey. Preprint, arXiv:2006.14799. Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. 2025. Judgelrm: Large reasoning models as a judge. Preprint, arXiv:2504.00050. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evalua- tions? InProceedings...

work page arXiv 2006

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Lance Eliot. 2025. Why doing chain-of-thought prompt- ing in reasoning llms gums up the works. Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan Mc- Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summariz...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4.Preprint, arXiv:2403.02839. T. K. Koo and M. Y . Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for re- liability research.Journal of Chiropractic Medicine, 15(2):155–163. Erratum in: J Chiropr Med. 20...

work page doi:10.1016/j.jcm.2017.10.001 2016

[4] [4]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chen...

work page internal anchor Pith review Pith/arXiv arXiv 2004

[5] [5]

GPT-4 Technical Report

Intra and inter-rater reliability of screening for movement impairments: Movement control tests from the foundation matrix.Journal of Sports Sci- ence and Medicine, 14(2):427–440. Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt LLM evaluation. Transactions of the Asso...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Black-box uncertainty quantification method for llm-as-a-judge.Preprint, arXiv:2410.11594. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models.Preprint, arXiv:2203.11171. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten B...

work page arXiv 2023

[7] [7]

Qwen3 Technical Report

Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measure- ment theory. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 10967–10982, Singapore. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Hallucinations: Information added not in the source

work page

[9] [9]

Contradictions: Statements opposing source content

work page

[10] [10]

Entity Errors: Incorrect names/roles/locations

work page

[11] [11]

Omissions: Key points missing from the summary

work page

[12] [12]

Output: A single number0for consistent sum- mary and1for inconsistent summary

Temporal Errors: Wrong se- quence/timeframe of events. Output: A single number0for consistent sum- mary and1for inconsistent summary. Document:{{Full source text}} Summary:{{Generated Summary}} Figure 8: Prompt used for each run in SummaC bench- mark. ForSummEval, there are four different metrics, coherence, consistency, fluency and relevance. For each ru...

work page 1960

[13] [13]

Read article and identify key points

work page

[14] [14]

Check if summary presents them clearly and logically

work page

[15] [15]

Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Coherence: (a) Coherence Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Consistency (1–5)– the summary should not contradict the source; penalize hallucinated fac...

work page

[16] [16]

Read article and summary

work page

[17] [17]

Identify any factual errors

work page

[18] [18]

Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Consistency: (b) Consistency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Fluency (1–5)– grammar, spelling, punctuation, word choice, and sentence structure. Ev...

work page

[19] [19]

Identify language issues affecting readability

work page

[20] [20]

Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article

Score 1–5. Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Relevance (1–5)– includes only important information from the source; penalize redundancy. Evaluation Steps:

work page

[21] [21]

Read summary and article

work page

[22] [22]

Assess coverage of key points

work page

[23] [23]

[[A]]" if assistant A is better,

Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Relevance: (d) Relevance Figure 9: Prompts For Evaluating Generated Summaries From SummEval Using Four Metrics Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. Your evaluation ...

work page 2019