LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
if you’d like
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
CRAFT is a unified bidirectional counterfactual reasoning framework that improves LLM performance on tabular QA and fact verification tasks over baselines on WikiTQ and TabFact.
Medium personality expression in LLM agents yields the most positive user perceptions in goal-oriented tasks, further improved by trait alignment.
FullCite introduces three strategies for structured inline citation generation in QA and finds LLMs identify relevant documents well but struggle with precise evidence spans on ASQA, BioASQ, and ExpertQA.
The authors propose an evaluation framework for LLM-generated structured search summaries and describe plans for implementing and testing it.
citing papers explorer
-
CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification
CRAFT is a unified bidirectional counterfactual reasoning framework that improves LLM performance on tabular QA and fact verification tasks over baselines on WikiTQ and TabFact.
-
Explicit Evidence Grounding via Structured Inline Citation Generation
FullCite introduces three strategies for structured inline citation generation in QA and finds LLMs identify relevant documents well but struggle with precise evidence spans on ASQA, BioASQ, and ExpertQA.