pith. machine review for the scientific record. sign in

arxiv: 2507.17746 · v2 · submitted 2025-07-23 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningrubrics as rewardsLLM post-trainingreward designmedical reasoningscience QAevaluation benchmarksLLM-as-judge
0
0 comments X

The pith

Using instance-specific rubrics as rewards enables effective reinforcement learning on nuanced reasoning tasks where binary correctness does not apply.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Rubrics as Rewards (RaR) to extend reinforcement learning with verifiable rewards into domains like medicine and science, where success requires multi-criteria judgments rather than clear right-or-wrong answers. Instead of relying on direct Likert-scale scores from an LLM judge, RaR aggregates detailed rubric feedback into scalar rewards for on-policy training. Experiments show the strongest RaR variant delivers relative gains of up to 31 percent on HealthBench and 7 percent on GPQA-Diamond compared with standard LLM-as-judge baselines. The approach also produces more stable results across judge sizes and works well whether evaluation uses rubrics or multiple-choice formats. This opens a route to apply RL-style post-training to real-world reasoning problems that lack simple verification signals.

Core claim

Rubrics as Rewards (RaR) is an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by converting instance-specific rubric feedback into reward signals. Multiple aggregation strategies were tested; the best variant yields relative improvements of up to 31% on HealthBench and 7% on GPQA-Diamond over popular LLM-as-judge baselines that use direct Likert-based rewards. RaR-trained policies adapt to both rubric-based and multiple-choice evaluations, deliver stronger alignment when smaller judges are used, and exhibit lower performance variance across judge scales.

What carries the argument

Instance-specific rubrics aggregated into scalar rewards that serve as the training signal for on-policy policy optimization.

If this is right

  • RaR policies maintain strong results on both rubric-scored and multiple-choice evaluations after training.
  • Rubric-based rewards produce lower performance variance when models are judged by LLMs of different sizes.
  • Smaller judges achieve better alignment with human preferences when rubrics rather than raw Likert scores are used as the reward source.
  • The method extends RL-style post-training to medical and scientific reasoning without requiring binary correctness signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rubric-style rewards could be tested in additional non-verifiable areas such as legal analysis or ethical reasoning where multi-criteria judgment is required.
  • Automating rubric creation might allow the approach to scale to new tasks without manual rubric design for each instance.
  • The reduced variance with smaller judges suggests RaR could lower the compute cost of reward modeling in production pipelines.

Load-bearing premise

Rubric feedback can be reliably turned into scalar rewards that give a less biased and more stable training signal than direct LLM Likert judgments, and this holds across different domains and judge sizes.

What would settle it

A controlled experiment on a new non-verifiable domain where every RaR aggregation strategy produces average performance no higher than direct Likert baselines and shows equal or greater variance when smaller judges are used.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards}$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31\%$ on HealthBench and $7\%$ on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Rubrics as Rewards (RaR), an on-policy RL method that derives scalar rewards from instance-specific rubrics to extend RLVR to nuanced, non-verifiable reasoning tasks in medical and scientific domains. It evaluates several rubric aggregation strategies and reports that the best variant yields relative gains of up to 31% on HealthBench and 7% on GPQA-Diamond versus direct Likert-based LLM-as-judge baselines, with additional benefits for smaller judges and reduced variance across judge scales.

Significance. If the central claims hold, the work offers a concrete mechanism for supplying structured, multi-criteria reward signals in open-ended domains where binary verification is unavailable. The reported improvements and the finding that smaller models benefit disproportionately could inform practical post-training pipelines for alignment on complex reasoning benchmarks.

major comments (3)
  1. [§4.2 and Table 2] §4.2 and Table 2: The performance claims rest on a particular rubric-to-scalar aggregation operator, yet no ablation replaces this operator with a simple mean or an alternative judge family; without such controls it remains unclear whether the 31% and 7% lifts are attributable to the rubric structure itself.
  2. [§5.1] §5.1: No inter-judge agreement statistics (e.g., Pearson correlation or Cohen’s kappa) are supplied comparing the aggregated rubric scalar to the direct Likert scalar, leaving the claim of reduced bias and greater stability unsubstantiated.
  3. [Table 2] Table 2: The reported relative improvements lack accompanying details on the number of independent runs, statistical significance tests, or standard-error estimates, making it impossible to judge whether the gains are reliable.
minor comments (2)
  1. [§3.1] §3.1: The reward-function definition would be clearer if an explicit equation showed the precise mapping from per-criterion rubric scores to the final scalar reward.
  2. [Figure 3] Figure 3: Error bars or confidence intervals should be added to the variance-reduction plot to allow visual assessment of the claimed stability improvement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful comments and suggestions. We address each major comment point-by-point below and plan to incorporate the revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4.2 and Table 2] §4.2 and Table 2: The performance claims rest on a particular rubric-to-scalar aggregation operator, yet no ablation replaces this operator with a simple mean or an alternative judge family; without such controls it remains unclear whether the 31% and 7% lifts are attributable to the rubric structure itself.

    Authors: We thank the referee for this observation. While §4.2 presents results for multiple rubric aggregation strategies, we acknowledge that a direct comparison to a simple mean aggregation and to alternative judge families was not included. In the revised manuscript, we will add these ablations to better isolate the contribution of the structured rubric approach. revision: yes

  2. Referee: [§5.1] §5.1: No inter-judge agreement statistics (e.g., Pearson correlation or Cohen’s kappa) are supplied comparing the aggregated rubric scalar to the direct Likert scalar, leaving the claim of reduced bias and greater stability unsubstantiated.

    Authors: This is a valid point. We will include inter-judge agreement statistics, such as Pearson correlations and Cohen’s kappa, between the aggregated rubric scalars and direct Likert scores in the revised §5.1 to provide quantitative support for the claims regarding reduced bias and improved stability. revision: yes

  3. Referee: [Table 2] Table 2: The reported relative improvements lack accompanying details on the number of independent runs, statistical significance tests, or standard-error estimates, making it impossible to judge whether the gains are reliable.

    Authors: We agree that reporting these details is essential for assessing reliability. We will add the number of independent runs, standard error estimates, and results of statistical significance tests to Table 2 and the associated text in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmark comparisons, not self-referential definitions or fitted inputs.

full rationale

The paper introduces RaR as an on-policy RL method using rubric aggregation strategies and reports measured improvements (up to 31% on HealthBench, 7% on GPQA-Diamond) against LLM-as-judge baselines. No equations, derivations, or self-citations are presented that reduce the central claim to a fitted parameter renamed as prediction or to a uniqueness theorem imported from prior author work. The aggregation strategies are evaluated experimentally rather than defined to force the outcome by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method assumes rubric scores can be turned into effective scalar rewards without introducing new biases; this is tested empirically but not derived from first principles.

axioms (1)
  • domain assumption Rubric feedback aggregated into rewards yields a more reliable training signal than direct Likert scores from LLMs
    Core premise of RaR; invoked to justify the approach over baselines.

pith-pipeline@v0.9.0 · 5524 in / 1133 out tokens · 43077 ms · 2026-05-13T05:58:11.714367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  2. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 7.0

    RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...

  3. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  4. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  5. Visual Preference Optimization with Rubric Rewards

    cs.CV 2026-04 unverdicted novelty 7.0

    rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

  6. Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

  7. Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

    cs.LG 2026-03 unverdicted novelty 7.0

    A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.

  8. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

  9. Reward Hacking in Rubric-Based Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...

  10. SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

    cs.CL 2026-05 unverdicted novelty 6.0

    SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.

  11. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 6.0

    RubricRefine raises average tool-use reliability to 0.86 on M3ToolEval across seven models by scoring candidate code against generated contract rubrics before execution, beating prior inference-time methods at 2.6X lo...

  12. CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

    cs.CL 2026-05 unverdicted novelty 6.0

    CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

  13. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  14. Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.

  15. SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage poin...

  16. RVPO: Risk-Sensitive Alignment via Variance Regularization

    cs.LG 2026-05 unverdicted novelty 6.0

    RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.

  17. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  18. Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

    cs.CL 2026-04 unverdicted novelty 6.0

    POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...

  19. C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

    cs.CL 2026-04 unverdicted novelty 6.0

    C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to matc...

  20. ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

    cs.AI 2026-04 unverdicted novelty 6.0

    ReflectRM improves generative reward models by adding self-reflection on analysis quality within a unified training setup for response and analysis preferences, yielding accuracy gains and reduced positional bias on b...

  21. Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

    cs.LG 2026-04 unverdicted novelty 6.0

    Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

  22. MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

    cs.CL 2026-03 unverdicted novelty 6.0

    MoRI improves LLM scientific ideation by training models via SFT to generate motivations followed by composite RL rewards for entropy-aware information gain and contrastive semantic alignment, leading to higher novelt...

  23. Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

    cs.CL 2026-05 unverdicted novelty 5.0

    Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

  24. SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

    cs.AI 2026-05 unverdicted novelty 5.0

    SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% averag...

  25. LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.

  26. PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

    cs.LG 2026-04 unverdicted novelty 5.0

    PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

  27. SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

    cs.AI 2026-04 unverdicted novelty 4.0

    SPARD dynamically tunes multi-objective reward weights and data importance in LLM reinforcement learning alignment using a self-paced curriculum driven by reward dynamics and data utility.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 26 Pith papers · 8 internal anchors

  1. [1]

    Anugraha, Z

    D. Anugraha, Z. Tang, L. J. V . Miranda, H. Zhao, M. R. Farhansyah, G. Kuwanto, D. Wijaya, and G. I. Winata. R3: Robust rubric-agnostic reward models.arXiv preprint arXiv:2505.13388, 2025

  2. [2]

    R. K. Arora, J. Wei, R. S. Hicks, P . Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025. 9

  3. [3]

    J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024

  4. [4]

    L. Chen, C. Zhu, D. Soselia, J. Chen, T. Zhou, T. Goldstein, H. Huang, M. Shoeybi, and B. Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf.arXiv preprint arXiv:2402.07319, 2024

  5. [5]

    X. Chen, G. Li, Z. Wang, B. Jin, C. Qian, Y. Wang, H. Wang, Y. Zhang, D. Zhang, T. Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

  6. [6]

    G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu, Q. Xu, W. Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  7. [7]

    Qa-lign: Aligning llms through constitutionally decomposed qa.arXiv preprint arXiv:2506.08123,

    J. Dineen, A. RRV , Q. Liu, Z. Xu, X. Ye, M. Shen, Z. Li, S. Lu, C. Baral, M. Chen, et al. Qa-lign: Aligning llms through constitutionally decomposed qa.arXiv preprint arXiv:2506.08123, 2025

  8. [8]

    V . Gallego. Configurable preference tuning with rubric-guided synthetic data.arXiv preprint arXiv:2506.11702, 2025

  9. [9]

    General reasoning, 2025

    General Reasoning. General reasoning, 2025. URLhttps://gr.inc/

  10. [10]

    arXiv preprint arXiv:2305.15717 , year =

    A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu, P . Abbeel, S. Levine, and D. Song. The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717, 2023

  11. [11]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    J. Guo, Z. Chi, L. Dong, Q. Dong, X. Wu, S. Huang, and F. Wei. Reward reasoning model.arXiv preprint arXiv:2505.14674, 2025

  13. [13]

    Llm-rubric: A multidimen- sional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2024

    H. Hashemi, J. Eisner, C. Rosset, B. Van Durme, and C. Kedzie. Llm-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts.arXiv preprint arXiv:2501.00274, 2024

  14. [14]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  15. [15]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  16. [16]

    Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

    M. Khalifa, R. Agarwal, L. Logeswaran, J. Kim, H. Peng, M. Lee, H. Lee, and L. Wang. Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

  17. [17]

    S. Kim, J. Suk, S. Longpre, B. Y. Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, and M. Seo. Prometheus 2: An open source language model specialized in evaluating other language models.arXiv preprint arXiv:2405.01535, 2024

  18. [18]

    Z. M. Kim, C. Park, V . Raheja, S. Kim, and D. Kang. Toward evaluative thinking: Meta policy optimization with evolving reward models.arXiv preprint arXiv:2504.20157, 2025

  19. [19]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, et al. T \" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  20. [20]

    S. Li, S. Dong, K. Luan, X. Di, and C. Ding. Enhancing reasoning through process supervision with monte carlo tree search.arXiv preprint arXiv:2501.01478, 2025

  21. [21]

    Lightman, V

    H. Lightman, V . Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  22. [22]

    D. Lu, X. Tan, R. Xu, T. Yao, C. Qu, W. Chu, Y. Xu, and Y. Qi. Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain.arXiv preprint arXiv:2501.15587, 2025

  23. [23]

    X. Ma, Q. Liu, D. Jiang, G. Zhang, Z. Ma, and W. Chen. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025. 10

  24. [24]

    Openai o3-min, 2025

    OpenAI o3-mini. Openai o3-min, 2025. URLhttps://openai.com/index/openai-o3-mini/

  25. [25]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  26. [26]

    Rubric is all you need: Enhancing llm-based code evaluation with question-specific rubrics.arXiv preprint arXiv:2503.23989, 2025

    A. Pathak, R. Gandhi, V . Uttam, Y. Nakka, A. R. Jindal, P . Ghosh, A. Ramamoorthy, S. Verma, A. Mittal, A. Ased, et al. Rubric is all you need: Enhancing llm-based code evaluation with question-specific rubrics. arXiv preprint arXiv:2503.23989, 2025

  27. [27]

    J. Ruan, I. Nair, S. Cao, A. Liu, S. Munir, M. Pollens-Dempsey, T. Chiang, L. Kates, N. David, S. Chen, et al. Expertlongbench: Benchmarking language models on expert-level long-form generation tasks with structured checklists.arXiv preprint arXiv:2506.01241, 2025

  28. [28]

    Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    A long way to go: Investigat- ing length correlations in RLHF.arXiv preprint arXiv:2310.03716,

    P . Singhal, T. Goyal, J. Xu, and G. Durrett. A long way to go: Investigating length correlations in rlhf.arXiv preprint arXiv:2310.03716, 2023

  30. [30]

    arXiv preprint arXiv:2501.17399 , year=

    V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y. Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms.arXiv preprint arXiv:2501.17399, 2025

  31. [32]

    Y. Su, D. Yu, L. Song, J. Li, H. Mi, Z. Tu, M. Zhang, and D. Yu. Expanding rl with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

  32. [33]

    In: NeurIPS (2025),https://arxiv.org/abs/2507.18624

    V . Viswanathan, Y. Sun, S. Ma, X. Kong, M. Cao, G. Neubig, and T. Wu. Checklists are better than reward models for aligning language models.arXiv preprint arXiv:2507.18624, 2025

  33. [34]

    H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

  34. [35]

    Whitehouse, T

    C. Whitehouse, T. Wang, P . Yu, X. Li, J. Weston, I. Kulikov, and S. Saha. J1: Incentivizing thinking in llm-as-a-judge via reinforcement learning.arXiv preprint arXiv:2505.10320, 2025

  35. [36]

    Z. Ye, F. Greenlee-Scott, M. Bartolo, P . Blunsom, J. A. Campos, and M. Gallé. Improving reward models with synthetic critiques.arXiv preprint arXiv:2405.20850, 2024

  36. [37]

    W. Yuan, J. Yu, S. Jiang, K. Padthe, Y. Li, I. Kulikov, K. Cho, D. Wang, Y. Tian, J. E. Weston, et al. Nat- uralreasoning: Reasoning in the wild with 2.8 m challenging questions.arXiv preprint arXiv:2502.13124, 2025

  37. [38]

    Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

    S. Zhang, Q. Liu, G. Qin, T. Naumann, and H. Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

  38. [39]

    X. Zhao, Z. Kang, A. Feng, S. Levine, and D. Song. Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

  39. [40]

    Yes" or

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm- as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 11 A. Appendix A.1 Details of RaR-Medicine dataset The following is illustrates an example from the RaR-Medicine dataset with m...