SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.
AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics
years
2026 2verdicts
UNVERDICTED 2representative citing papers
SCORE is a shared-parameter co-evolutionary framework coupling generation and evaluation of deep research reports with a meta-harness to adapt evaluation standards as performance improves.
citing papers explorer
-
SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.