Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

Thakur, Aman Singh, Choudhary, Kartik, Ramayapally, Venkat Srinik, Vaidyanathan, Sankaran, Hupkes, Dieuwke · 2025

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

cs.CL · 2026-05-13 · accept · novelty 7.0

LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

cs.SE · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.

citing papers explorer

Showing 2 of 2 citing papers.

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics cs.CL · 2026-05-13 · accept · none · ref 29
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents cs.SE · 2026-05-17 · unverdicted · none · ref 28 · 2 links
DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

fields

years

verdicts

representative citing papers

citing papers explorer