Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Bugeun Kim; Chanhee Cho; Hyeonchu Park; Junhyuk Choi; Sohhyung Park

arxiv: 2602.00521 · v2 · pith:7WLALX5Enew · submitted 2026-01-31 · 💻 cs.AI

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Junhyuk Choi , Sohhyung Park , Chanhee Cho , Hyeonchu Park , Bugeun Kim This is my paper

classification 💻 cs.AI

keywords llm-as-a-judgeframeworkreliabilityresponsediagnosinghumanitemjudges

0 comments

read the original abstract

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
cs.CL 2026-04 unverdicted novelty 7.0

Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that grading accuracy declines at different rates with response difficulty, with errors clustering on the partially correct label and difficult...
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
cs.CL 2026-04 unverdicted novelty 7.0

Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.