From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

Bolin Shen; Huaiyuan Yao; Hua Wei; Wanpeng Xu; Yihan Hong; Yushun Dong

arxiv: 2601.08654 · v2 · pith:LQJIKFIBnew · submitted 2026-01-13 · 💻 cs.CL · cs.AI· cs.LG

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

Yihan Hong , Huaiyuan Yao , Bolin Shen , Wanpeng Xu , Hua Wei , Yushun Dong This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords humanrubricrulersscoreevaluationscoringtextmodels

0 comments

read the original abstract

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
cs.CL 2026-04 unverdicted novelty 6.0

LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.