pith. sign in

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

fields

cs.AI 1 cs.LG 1

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

PReMISE discovers and audits rubric sets for LLM judges, finding no existing source meets all reliability, preference-fit, and robustness criteria simultaneously while showing two repair methods improve accuracy and reduce exploitability.

citing papers explorer

Showing 2 of 2 citing papers.

  • PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges cs.AI · 2026-05-29 · unverdicted · none · ref 1 · internal anchor

    PReMISE discovers and audits rubric sets for LLM judges, finding no existing source meets all reliability, preference-fit, and robustness criteria simultaneously while showing two repair methods improve accuracy and reduce exploitability.

  • RUBAS: Rubric-Based Reinforcement Learning for Agent Safety cs.LG · 2026-06-02 · unverdicted · none · ref 2 · internal anchor

    RUBAS decomposes agent behavior into four rubric dimensions to supply fine-grained RL rewards that improve safety while preserving task utility on agent benchmarks.