Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks
Pith reviewed 2026-05-18 03:29 UTC · model grok-4.3
The pith
LLM judges assign inconsistent scores to identical outputs across repeated runs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. The inconsistency is quantified across different NLG tasks and benchmarks to determine if judicious use of LLM judges can remain viable with proper guidelines.
What carries the argument
Intra-rater reliability measured by variance in scores assigned by the same LLM judge to identical inputs across independent runs.
If this is right
- Single-run LLM judgments may not support stable comparisons between different models or systems.
- Averaging scores over multiple runs could reduce variance and improve practical usability.
- Inconsistency levels differ across NLG tasks, so some benchmarks may need more caution than others.
- Explicit guidelines for LLM judge usage can help retain value while managing the observed variance.
Where Pith is reading between the lines
- Fixing temperature to zero or using deterministic sampling might narrow but not remove the variance.
- The finding points toward ensemble or fine-tuned judge variants as possible ways to increase consistency.
- Similar self-inconsistency could affect other LLM decision tasks that require repeatable outputs.
Load-bearing premise
That repeated independent runs of the same LLM judge on identical inputs provide a valid measure of intra-rater reliability rather than reflecting prompt sampling noise or temperature effects.
What would settle it
Finding that scores remain identical or nearly identical across many repeated runs under fixed temperature and prompt conditions would contradict the reported low reliability.
Figures
read the original abstract
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates self-inconsistency in LLM-as-a-judge setups for NLG evaluation. It claims that LLM judges exhibit low intra-rater reliability, shown through variance in assigned scores across repeated independent runs on identical inputs, which can render ratings inconsistent or arbitrary. The authors quantify this phenomenon across multiple NLG tasks and benchmarks and discuss conditions under which LLM judges may still be used judiciously.
Significance. If the observed variance reflects genuine instability in the judge's evaluative mapping rather than sampling artifacts, the result would be significant for automated evaluation research, where LLM judges are increasingly adopted for their alignment with human preferences over n-gram or embedding metrics. The cross-task quantification provides useful breadth, though the work would benefit from stronger controls to secure the reliability interpretation.
major comments (2)
- [Abstract] Abstract: the claim that experiments demonstrate low intra-rater reliability rests on observed score variance, but the abstract (and presumably the experimental description) provides no details on the number of runs, statistical tests, effect sizes, or controls for prompt variation. This leaves the central empirical claim without visible supporting evidence and is load-bearing for the paper's contribution.
- [Experimental setup] Experimental setup: the central claim interprets score variance across repeated runs on identical inputs as evidence of intra-rater unreliability. If the runs use temperature > 0 or default stochastic decoding, this variance is the expected outcome of token sampling and does not demonstrate instability in the judge's evaluative process. The manuscript must report results at temperature=0 with fixed seeds to support the reliability interpretation.
minor comments (2)
- [Related Work] Add explicit discussion of prior work on LLM judge consistency and reliability metrics to better situate the contribution.
- Ensure that any tables or figures reporting score distributions include error bars, run counts, and clear statistical comparisons.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript investigating self-inconsistency in LLM-as-a-judge frameworks. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that experiments demonstrate low intra-rater reliability rests on observed score variance, but the abstract (and presumably the experimental description) provides no details on the number of runs, statistical tests, effect sizes, or controls for prompt variation. This leaves the central empirical claim without visible supporting evidence and is load-bearing for the paper's contribution.
Authors: We agree that the abstract would benefit from greater specificity to make the empirical support transparent. In the revised version, we will expand the abstract to report the number of repeated runs per input (10), the primary statistical measures (mean, standard deviation, and variance), relevant effect sizes, and explicit confirmation that input prompts and task instructions were held fixed across runs. revision: yes
-
Referee: [Experimental setup] Experimental setup: the central claim interprets score variance across repeated runs on identical inputs as evidence of intra-rater unreliability. If the runs use temperature > 0 or default stochastic decoding, this variance is the expected outcome of token sampling and does not demonstrate instability in the judge's evaluative process. The manuscript must report results at temperature=0 with fixed seeds to support the reliability interpretation.
Authors: We acknowledge the referee's point that stochastic decoding can introduce sampling variance unrelated to evaluative instability. To isolate the judge's scoring behavior, we will add a new set of experiments conducted at temperature=0 with fixed random seeds and report the resulting score distributions and variance statistics alongside the original results. This will allow readers to distinguish sampling effects from any inherent inconsistency in the LLM judge. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper reports experimental measurements of score variance across repeated LLM judge runs on identical inputs. No derivations, equations, fitted parameters, or self-citations are present in the provided text or abstract. The central claim rests on direct observation of inconsistency rather than any reduction of a 'prediction' or 'result' to its own inputs by construction. This is a standard empirical reliability study whose validity can be assessed against external benchmarks (e.g., temperature=0 controls) without internal circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We ran each judge LLM on the same set of generations independently for three runs... computed intra-rater reliability using Krippendorff’s Alpha.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLM judges have low intra-rater reliability in their assigned scores across different runs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evaluation of text generation: A survey. Preprint, arXiv:2006.14799. Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. 2025. Judgelrm: Large reasoning models as a judge. Preprint, arXiv:2504.00050. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evalua- tions? InProceedings...
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Lance Eliot. 2025. Why doing chain-of-thought prompt- ing in reasoning llms gums up the works. Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan Mc- Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summariz...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge model is not a general substitute for gpt-4.Preprint, arXiv:2403.02839. T. K. Koo and M. Y . Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for re- liability research.Journal of Chiropractic Medicine, 15(2):155–163. Erratum in: J Chiropr Med. 20...
-
[4]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chen...
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[5]
Intra and inter-rater reliability of screening for movement impairments: Movement control tests from the foundation matrix.Journal of Sports Sci- ence and Medicine, 14(2):427–440. Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2024. State of what art? a call for multi-prompt LLM evaluation. Transactions of the Asso...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Black-box uncertainty quantification method for llm-as-a-judge.Preprint, arXiv:2410.11594. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models.Preprint, arXiv:2203.11171. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten B...
-
[7]
Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measure- ment theory. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 10967–10982, Singapore. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Ch...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Hallucinations: Information added not in the source
-
[9]
Contradictions: Statements opposing source content
-
[10]
Entity Errors: Incorrect names/roles/locations
-
[11]
Omissions: Key points missing from the summary
-
[12]
Output: A single number0for consistent sum- mary and1for inconsistent summary
Temporal Errors: Wrong se- quence/timeframe of events. Output: A single number0for consistent sum- mary and1for inconsistent summary. Document:{{Full source text}} Summary:{{Generated Summary}} Figure 8: Prompt used for each run in SummaC bench- mark. ForSummEval, there are four different metrics, coherence, consistency, fluency and relevance. For each ru...
work page 1960
-
[13]
Read article and identify key points
-
[14]
Check if summary presents them clearly and logically
-
[15]
Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Coherence: (a) Coherence Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Consistency (1–5)– the summary should not contradict the source; penalize hallucinated fac...
-
[16]
Read article and summary
-
[17]
Identify any factual errors
-
[18]
Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Consistency: (b) Consistency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Fluency (1–5)– grammar, spelling, punctuation, word choice, and sentence structure. Ev...
-
[19]
Identify language issues affecting readability
-
[20]
Score 1–5. Example: Summary: {{Summary}} Evaluation Form (scores ONLY): Fluency: (c) Fluency Instructions:You will be given one summary written for a news article. Your task is to rate the summary on one metric. Evaluation Criteria: Relevance (1–5)– includes only important information from the source; penalize redundancy. Evaluation Steps:
-
[21]
Read summary and article
-
[22]
Assess coverage of key points
-
[23]
[[A]]" if assistant A is better,
Score 1–5. Example: News Article: {{Source Text}} Summary: {{Summary}} Evaluation Form (scores ONLY): Relevance: (d) Relevance Figure 9: Prompts For Evaluating Generated Summaries From SummEval Using Four Metrics Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user questions. Your evaluation ...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.