Eval-Skill synthesizes reusable domain-level evaluation skills from 100 cases via two-stage exploration-guided evolution and injects them into judge context, improving LLM judges on RewardBench 2 by 13-18%.
Okay, so I need to evaluate
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling
Eval-Skill synthesizes reusable domain-level evaluation skills from 100 cases via two-stage exploration-guided evolution and injects them into judge context, improving LLM judges on RewardBench 2 by 13-18%.