JudgeSense benchmark shows LLM judge consistency does not reliably improve with model scale, with coherence most sensitive to prompt changes and factuality more stable.
Compare the two responses. Pick A or B
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
JudgeSense benchmark shows LLM judge consistency does not reliably improve with model scale, with coherence most sensitive to prompt changes and factuality more stable.