pith. sign in

arxiv: 2510.27106 · v1 · submitted 2025-10-31 · 💻 cs.CL

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Pith reviewed 2026-05-18 03:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM judgesintra-rater reliabilityNLG evaluationjudgment varianceLLM-as-a-judgerating inconsistencyevaluation metricsself-consistency
0
0 comments X

The pith

LLM judges assign inconsistent scores to identical outputs across repeated runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the growing practice of using large language models to evaluate natural language generation outputs, a method that often matches human preferences better than older metrics. Experiments reveal that these LLM judges frequently change the scores they give to the exact same text when run multiple times independently. This low consistency can make the ratings appear nearly arbitrary, which hinders any attempt to validate how accurate the judgments really are. The authors measure the extent of this problem across several NLG tasks and benchmarks while checking whether guidelines can still allow useful application of LLM judges.

Core claim

LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. The inconsistency is quantified across different NLG tasks and benchmarks to determine if judicious use of LLM judges can remain viable with proper guidelines.

What carries the argument

Intra-rater reliability measured by variance in scores assigned by the same LLM judge to identical inputs across independent runs.

If this is right

  • Single-run LLM judgments may not support stable comparisons between different models or systems.
  • Averaging scores over multiple runs could reduce variance and improve practical usability.
  • Inconsistency levels differ across NLG tasks, so some benchmarks may need more caution than others.
  • Explicit guidelines for LLM judge usage can help retain value while managing the observed variance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fixing temperature to zero or using deterministic sampling might narrow but not remove the variance.
  • The finding points toward ensemble or fine-tuned judge variants as possible ways to increase consistency.
  • Similar self-inconsistency could affect other LLM decision tasks that require repeatable outputs.

Load-bearing premise

That repeated independent runs of the same LLM judge on identical inputs provide a valid measure of intra-rater reliability rather than reflecting prompt sampling noise or temperature effects.

What would settle it

Finding that scores remain identical or nearly identical across many repeated runs under fixed temperature and prompt conditions would contradict the reported low reliability.

Figures

Figures reproduced from arXiv: 2510.27106 by Julia Hockenmaier, Rajarshi Haldar.

Figure 2
Figure 2. Figure 2: Annotating a summary from the SummEval benchmark with scores ranging from 1 to 5 on the met￾rics: coherence, consistency, fluency and relevance. SummEval This is a summarization benchmark with 1700 examples where judges rate model￾generated summaries of source documents on a 1–5 scale across four metrics: coherence (how well the sentences in the summary fit together), consis￾tency (the factual accuracy of … view at source ↗
Figure 1
Figure 1. Figure 1: Annotating a summary of an article in Sum [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ranking conversations from MT-bench with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Self-reliability of LLM judges on SummEval. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Balanced Accuracies of different LLM judges [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inter-rater reliability of LLM judges against [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inter-rater reliability within and across both [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used for each run in SummaC bench [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompts For Evaluating Generated Summaries From SummEval Using Four Metrics [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for Evaluating Two Generated Responses to Queries Labeled [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for Evaluating Two Generated Responses to Queries Labeled [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distributions of scores assigned by different raters across all metrics in the SummEval dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates self-inconsistency in LLM-as-a-judge setups for NLG evaluation. It claims that LLM judges exhibit low intra-rater reliability, shown through variance in assigned scores across repeated independent runs on identical inputs, which can render ratings inconsistent or arbitrary. The authors quantify this phenomenon across multiple NLG tasks and benchmarks and discuss conditions under which LLM judges may still be used judiciously.

Significance. If the observed variance reflects genuine instability in the judge's evaluative mapping rather than sampling artifacts, the result would be significant for automated evaluation research, where LLM judges are increasingly adopted for their alignment with human preferences over n-gram or embedding metrics. The cross-task quantification provides useful breadth, though the work would benefit from stronger controls to secure the reliability interpretation.

major comments (2)
  1. [Abstract] Abstract: the claim that experiments demonstrate low intra-rater reliability rests on observed score variance, but the abstract (and presumably the experimental description) provides no details on the number of runs, statistical tests, effect sizes, or controls for prompt variation. This leaves the central empirical claim without visible supporting evidence and is load-bearing for the paper's contribution.
  2. [Experimental setup] Experimental setup: the central claim interprets score variance across repeated runs on identical inputs as evidence of intra-rater unreliability. If the runs use temperature > 0 or default stochastic decoding, this variance is the expected outcome of token sampling and does not demonstrate instability in the judge's evaluative process. The manuscript must report results at temperature=0 with fixed seeds to support the reliability interpretation.
minor comments (2)
  1. [Related Work] Add explicit discussion of prior work on LLM judge consistency and reliability metrics to better situate the contribution.
  2. Ensure that any tables or figures reporting score distributions include error bars, run counts, and clear statistical comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript investigating self-inconsistency in LLM-as-a-judge frameworks. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that experiments demonstrate low intra-rater reliability rests on observed score variance, but the abstract (and presumably the experimental description) provides no details on the number of runs, statistical tests, effect sizes, or controls for prompt variation. This leaves the central empirical claim without visible supporting evidence and is load-bearing for the paper's contribution.

    Authors: We agree that the abstract would benefit from greater specificity to make the empirical support transparent. In the revised version, we will expand the abstract to report the number of repeated runs per input (10), the primary statistical measures (mean, standard deviation, and variance), relevant effect sizes, and explicit confirmation that input prompts and task instructions were held fixed across runs. revision: yes

  2. Referee: [Experimental setup] Experimental setup: the central claim interprets score variance across repeated runs on identical inputs as evidence of intra-rater unreliability. If the runs use temperature > 0 or default stochastic decoding, this variance is the expected outcome of token sampling and does not demonstrate instability in the judge's evaluative process. The manuscript must report results at temperature=0 with fixed seeds to support the reliability interpretation.

    Authors: We acknowledge the referee's point that stochastic decoding can introduce sampling variance unrelated to evaluative instability. To isolate the judge's scoring behavior, we will add a new set of experiments conducted at temperature=0 with fixed random seeds and report the resulting score distributions and variance statistics alongside the original results. This will allow readers to distinguish sampling effects from any inherent inconsistency in the LLM judge. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports experimental measurements of score variance across repeated LLM judge runs on identical inputs. No derivations, equations, fitted parameters, or self-citations are present in the provided text or abstract. The central claim rests on direct observation of inconsistency rather than any reduction of a 'prediction' or 'result' to its own inputs by construction. This is a standard empirical reliability study whose validity can be assessed against external benchmarks (e.g., temperature=0 controls) without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work appears to rest on standard empirical assumptions about benchmark validity and run independence.

pith-pipeline@v0.9.0 · 5639 in / 983 out tokens · 25606 ms · 2026-05-18T03:29:47.183680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

  1. [1]

    Preprint, arXiv:2006.14799

    Evaluation of text generation: A survey. Preprint, arXiv:2006.14799. Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. 2025. Judgelrm: Large reasoning models as a judge. Preprint, arXiv:2504.00050. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evalua- tions? InProceedings...

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Lance Eliot. 2025. Why doing chain-of-thought prompt- ing in reasoning llms gums up the works. Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan Mc- Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summariz...

  3. [3]

    An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4.Preprint, arXiv:2403.02839. T. K. Koo and M. Y . Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for re- liability research.Journal of Chiropractic Medicine, 15(2):155–163. Erratum in: J Chiropr Med. 20...

  4. [4]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chen...

  5. [5]

    GPT-4 Technical Report

    Intra and inter-rater reliability of screening for movement impairments: Movement control tests from the foundation matrix.Journal of Sports Sci- ence and Medicine, 14(2):427–440. Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt LLM evaluation. Transactions of the Asso...

  6. [6]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Black-box uncertainty quantification method for llm-as-a-judge.Preprint, arXiv:2410.11594. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models.Preprint, arXiv:2203.11171. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten B...

  7. [7]

    Qwen3 Technical Report

    Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measure- ment theory. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 10967–10982, Singapore. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Ch...

  8. [8]

    Hallucinations: Information added not in the source

  9. [9]

    Contradictions: Statements opposing source content

  10. [10]

    Entity Errors: Incorrect names/roles/locations

  11. [11]

    Omissions: Key points missing from the summary

  12. [12]

    Output: A single number0for consistent sum- mary and1for inconsistent summary

    Temporal Errors: Wrong se- quence/timeframe of events. Output: A single number0for consistent sum- mary and1for inconsistent summary. Document:{{Full source text}} Summary:{{Generated Summary}} Figure 8: Prompt used for each run in SummaC bench- mark. ForSummEval, there are four different metrics, coherence, consistency, fluency and relevance. For each ru...

  13. [13]

    Read article and identify key points

  14. [14]

    Check if summary presents them clearly and logically

  15. [15]

    Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Coherence: (a) Coherence Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Consistency (1–5)– the summary should not contradict the source; penalize hallucinated fac...

  16. [16]

    Read article and summary

  17. [17]

    Identify any factual errors

  18. [18]

    Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Consistency: (b) Consistency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Fluency (1–5)– grammar, spelling, punctuation, word choice, and sentence structure. Ev...

  19. [19]

    Identify language issues affecting readability

  20. [20]

    Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article

    Score 1–5. Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Relevance (1–5)– includes only important information from the source; penalize redundancy. Evaluation Steps:

  21. [21]

    Read summary and article

  22. [22]

    Assess coverage of key points

  23. [23]

    [[A]]" if assistant A is better,

    Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Relevance: (d) Relevance Figure 9: Prompts For Evaluating Generated Summaries From SummEval Using Four Metrics Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. Your evaluation ...