arXiv preprint arXiv:2602.00521 , year=

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory , author= · 2026 · cs.AI · arXiv 2602.00521

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

representative citing papers

LLM-Ideoplasticity: Measuring Ideological Plasticity in the Political Behavior of LLMs as a Context-Conditioned Distribution

cs.CY · 2026-05-26 · unverdicted · novelty 7.0

LLM political behavior forms a context-conditioned distribution over political space rather than a fixed point, with measured sensitivities to framing and language but a narrow overall range compared to real parties.

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

cs.CL · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.

citing papers explorer

Showing 3 of 3 citing papers.

LLM-Ideoplasticity: Measuring Ideological Plasticity in the Political Behavior of LLMs as a Context-Conditioned Distribution cs.CY · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
LLM political behavior forms a context-conditioned distribution over political space rather than a fixed point, with measured sensitivities to framing and language but a narrow overall range compared to real parties.
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory cs.CL · 2026-04-30 · unverdicted · none · ref 27 · 2 links · internal anchor
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors cs.CL · 2026-05-26 · unverdicted · none · ref 8 · internal anchor
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.

arXiv preprint arXiv:2602.00521 , year=

fields

years

verdicts

representative citing papers

citing papers explorer