LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
if you’d like
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
CRAFT is a unified bidirectional counterfactual reasoning framework that improves LLM performance on tabular QA and fact verification tasks over baselines on WikiTQ and TabFact.
Medium personality expression in LLM agents yields the most positive user perceptions in goal-oriented tasks, further improved by trait alignment.
FullCite introduces three strategies for structured inline citation generation in QA and finds LLMs identify relevant documents well but struggle with precise evidence spans on ASQA, BioASQ, and ExpertQA.
citing papers explorer
-
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
-
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.