Compare the two responses. Pick A or B

“Compare the two responses

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

cs.CL · 2026-04-26 · unverdicted · novelty 7.0 · 2 refs

JudgeSense benchmark shows LLM judge consistency does not reliably improve with model scale, with coherence most sensitive to prompt changes and factuality more stable.

citing papers explorer

Showing 1 of 1 citing paper.

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems cs.CL · 2026-04-26 · unverdicted · none · ref 20 · 2 links
JudgeSense benchmark shows LLM judge consistency does not reliably improve with model scale, with coherence most sensitive to prompt changes and factuality more stable.

Compare the two responses. Pick A or B

fields

years

verdicts

representative citing papers

citing papers explorer