LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
LCES : Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
MADRAG combines multi-agent debate with retrieval-augmented generation to produce training-free analytic essay scores that outperform prompt baselines and approach supervised systems.
citing papers explorer
-
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).