hub Mixed citations

Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949

· 2025 · arXiv 2508.16949

Mixed citation behavior. Most common role is background (60%).

14 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 3 support 1 unclear 1

representative citing papers

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

cs.LG · 2026-03-04 · unverdicted · novelty 7.0

ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.

Deep Research as Rubric for Reinforcement Learning

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

RUBRIC-ARROW is an alternating rubric generator and judge framework that uses probability-based scoring and pairwise preferences to improve pointwise reward modeling accuracy for LLM post-training in non-verifiable domains.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

Reinforcement Learning with Robust Rubric Rewards

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

cs.CL · 2026-05-22 · unverdicted · novelty 5.0 · 2 refs

ARES generates 100K rubric-annotated QA instances from raw documents and demonstrates superior rubric-based RL performance over baselines on open-ended benchmarks.

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

cs.CL · 2026-04-09 · accept · novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

citing papers explorer

Showing 14 of 14 citing papers.

Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 39
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy cs.LG · 2026-03-04 · unverdicted · none · ref 22
ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.
Deep Research as Rubric for Reinforcement Learning cs.CL · 2026-05-31 · unverdicted · none · ref 3
DR-rubric is a two-stage framework using iterative agentic search to generate atomic verifiable constraints for GRPO-based RL, achieving competitive performance on 6 benchmarks with 1K-3K examples via bootstrap or frontier-model rubrics.
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains cs.LG · 2026-05-27 · unverdicted · none · ref 33
RUBRIC-ARROW is an alternating rubric generator and judge framework that uses probability-based scoring and pairwise preferences to improve pointwise reward modeling accuracy for LLM post-training in non-verifiable domains.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 37
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 15
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text cs.CL · 2026-04-21 · unverdicted · none · ref 49
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL · 2026-04-03 · unverdicted · none · ref 40
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 35
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 182
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
Reinforcement Learning with Robust Rubric Rewards cs.CV · 2026-05-28 · unverdicted · none · ref 20
RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning cs.CL · 2026-05-22 · unverdicted · none · ref 16 · 2 links
ARES generates 100K rubric-annotated QA instances from raw documents and demonstrates superior rubric-based RL performance over baselines on open-ended benchmarks.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 65
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL · 2026-04-09 · accept · none · ref 92
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer