hub Canonical reference

Reinforcement learning with rubric anchors

URL https://api · 2025 · arXiv 2508.12790

Canonical reference. 86% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 86% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 1

citation-polarity summary

background 6 use dataset 1

representative citing papers

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

CHERRL is a new controllable testbed for reproducing, analyzing, and detecting reward hacking in rubric-based RL by injecting known biases into LLM-as-a-Judge systems.

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

cs.LG · 2026-03-04 · unverdicted · novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

ARCO introduces a co-evolving rubric model with generation and scoring heads plus a trajectory decomposition constraint that improves exact-match scores on multi-hop QA tasks over outcome, rubric, and process reward baselines.

Deep Research as Rubric for Reinforcement Learning

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.

Prompt-Level Reward Specifications for Open-Ended Post-Training

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

A prompt-level reward specification framework constructs reusable rubrics and executable checkers from prompts alone to deliver hybrid rewards combining requirement satisfaction, holistic quality, and deterministic constraints for LLM post-training.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.

SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to match larger rubric-augmented ones.

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 5.0

SAW uses coefficient of variation to dynamically reweight objectives in MORL for LLMs, improving training efficiency and performance on tool-calling and summarization tasks under GRPO and GDPO.

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

cs.CL · 2026-06-02 · unverdicted · novelty 5.0

QUBRIC co-designs queries and rubrics via teacher key points, contrastive generation, and learnability filtering to support GRPO training, yielding +5.5 on ArenaHard and +6.3 average transfer to legal/moral/narrative benchmarks.

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

RUBAS decomposes agent behavior into four rubric dimensions to supply fine-grained RL rewards that improve safety while preserving task utility on agent benchmarks.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

Reinforcement Learning with Robust Rubric Rewards

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step

citing papers explorer

Showing 28 of 28 citing papers.

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning cs.LG · 2026-06-03 · unverdicted · none · ref 3
CHERRL is a new controllable testbed for reproducing, analyzing, and detecting reward hacking in rubric-based RL by injecting known biases into LLM-as-a-Judge systems.
BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents cs.AI · 2026-06-02 · unverdicted · none · ref 9
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment cs.AI · 2026-05-17 · unverdicted · none · ref 11 · 2 links
AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 35
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 48
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 6
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy cs.LG · 2026-03-04 · unverdicted · none · ref 8
ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents cs.AI · 2026-06-19 · unverdicted · none · ref 8
ARCO introduces a co-evolving rubric model with generation and scoring heads plus a trajectory decomposition constraint that improves exact-match scores on multi-hop QA tasks over outcome, rubric, and process reward baselines.
Deep Research as Rubric for Reinforcement Learning cs.CL · 2026-05-31 · unverdicted · none · ref 7
DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.
Prompt-Level Reward Specifications for Open-Ended Post-Training cs.CL · 2026-05-28 · unverdicted · none · ref 3
A prompt-level reward specification framework constructs reusable rubrics and executable checkers from prompts alone to deliver hybrid rewards combining requirement satisfaction, holistic quality, and deterministic constraints for LLM post-training.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 17
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL · 2026-05-10 · unverdicted · none · ref 38
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 15
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 6
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents cs.LG · 2026-05-07 · unverdicted · none · ref 25
SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text cs.CL · 2026-04-21 · unverdicted · none · ref 17
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 39
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences cs.CL · 2026-04-15 · unverdicted · none · ref 5
C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to match larger rubric-augmented ones.
SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 4
SAW uses coefficient of variation to dynamically reweight objectives in MORL for LLMs, improving training efficiency and performance on tool-calling and summarization tasks under GRPO and GDPO.
QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards cs.CL · 2026-06-02 · unverdicted · none · ref 36
QUBRIC co-designs queries and rubrics via teacher key points, contrastive generation, and learnability filtering to support GRPO training, yielding +5.5 on ArenaHard and +6.3 average transfer to legal/moral/narrative benchmarks.
RUBAS: Rubric-Based Reinforcement Learning for Agent Safety cs.LG · 2026-06-02 · unverdicted · none · ref 3
RUBAS decomposes agent behavior into four rubric dimensions to supply fine-grained RL rewards that improve safety while preserving task utility on agent benchmarks.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 184
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
Reinforcement Learning with Robust Rubric Rewards cs.CV · 2026-05-28 · unverdicted · none · ref 11
RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models cs.CV · 2026-05-20 · unverdicted · none · ref 58
Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 88
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering cs.AI · 2026-05-04 · unverdicted · none · ref 14
SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% average Hits@k gain on medical, legal, and CWQ tasks.
Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care cs.AI · 2026-06-08 · unverdicted · none · ref 7
The paper describes Baichuan-M4, a coordinated medical agent system that reports leading scores across static knowledge, dynamic consultation, long-context memory, retrieval, OCR, and multimodal tasks with a 3.3% hallucination rate.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 215
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reinforcement learning with rubric anchors

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer