ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Fei Wang; Inderjit S. Dhillon; Nirmal Patel

arxiv: 2605.12667 · v2 · pith:4BVJROPAnew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel , Fei Wang , Inderjit S. Dhillon This is my paper

Pith reviewed 2026-05-19 14:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learning from AI feedbackpolicy optimizationnoisy discrete rewardsordinal decompositionLLM alignmentadvantage estimationrobust optimizationstochastic evaluation

0 comments

The pith

Decomposing discrete rewards into ordinal binary indicators stabilizes policy optimization against stochastic auto-rater noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models rely on reinforcement learning from AI feedback where auto-raters assign fluctuating discrete scores on rubrics like 1-10. A single noisy high or low score can distort normalization and weaken the learning signal in standard estimators. ODRPO converts each reward into a chain of binary success checks at rising difficulty levels and calculates advantages separately for each check before summing them. This structure keeps extreme outliers from dominating the update and creates a built-in progression from easier to harder thresholds. The result is more reliable training on Qwen models with measurable gains on grounding and instruction benchmarks and no added compute per step.

Core claim

ODRPO decomposes discrete rewards into a sequence of ordinal binary indicators. Advantages are computed and accumulated independently across these progressively challenging success thresholds. This structurally isolates evaluation noise and prevents outlier evaluations from corrupting the global update while establishing an implicit variance-aware learning curriculum. The method delivers relative improvements of up to 14.8 percent on FACTS-grounding-v2 and 7.5 percent on Alpaca-Evals for Qwen2.5-7B and Qwen3-4B models, requires no additional compute per step, and is supported by theoretical analysis confirming optimization stability.

What carries the argument

Ordinal decomposition of discrete rewards into binary success thresholds, with independent advantage computation and accumulation at each threshold.

If this is right

Outlier reward samples no longer dominate the global learning signal.
Training remains stable without the cost of repeated reward sampling and majority voting.
An implicit curriculum emerges from easier to harder success thresholds.
Optimization stability holds according to the provided theoretical analysis.
The approach applies directly to any RLAIF setting that uses multi-tier discrete rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could extend to other reinforcement learning domains that use ranked or ordinal feedback rather than continuous rewards.
Fewer repeated evaluations per prompt may become viable because noise is handled structurally instead of through averaging.
The threshold progression might be tuned to match task difficulty distributions for faster convergence on specific benchmarks.
Variance reduction in advantage estimates could be quantified directly to test the isolation claim on new auto-rater setups.

Load-bearing premise

Independently accumulating advantages across ordinal binary success thresholds will isolate stochastic evaluation noise without losing the overall learning signal from the original reward.

What would settle it

Measure whether the variance of advantage estimates or the frequency of policy updates driven by single extreme reward samples drops under ODRPO relative to GRPO or MaxRL on the same noisy reward traces.

Figures

Figures reproduced from arXiv: 2605.12667 by Fei Wang, Inderjit S. Dhillon, Nirmal Patel.

**Figure 2.** Figure 2: Visualization of Gini and Gini-Med weighting behaviors across four representative reward [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Alpaca-Evals values and time per step in seconds for majority voting ensemble analysis. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Statistical profiles for 1,000 datapoints from the Ultrafeedback dataset [ [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Training reward curves for GRPO and MaxRL using Qwen2.5-7B-Instruct as the policy [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Comparative analysis of final training rewards for MaxRL and ODRPO variants across [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: GRPO and MaxRL Mean Absolute Curl (MAC) value for varying [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{ODRPO}$), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Ordinal Decomposition for Robust Policy Optimization (ODRPO) for RLAIF with stochastic discrete rewards from LLM auto-raters. Discrete rewards are decomposed into ordinal binary indicators I_k = 1{r >= t_k} at ordered thresholds t_k; advantages A_k are computed independently for each threshold and summed for the policy gradient update. This is claimed to isolate evaluation noise, prevent outlier samples from corrupting the global signal, and induce an implicit variance-aware curriculum. Empirical results on Qwen2.5-7B and Qwen3-4B report relative gains of up to 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals versus GRPO and MaxRL baselines, with negligible extra compute per step. Theoretical analysis is supplied to establish optimization stability.

Significance. If the summed-advantage construction remains unbiased and lower-variance under the joint noise that actually arises from a single auto-rater call, ODRPO would supply a lightweight, theoretically grounded alternative to majority-voting or repeated sampling for noisy discrete rewards. The zero-overhead claim and the explicit stability analysis are strengths that would make the method immediately usable in large-scale LLM alignment pipelines.

major comments (2)

[Method section] Method section (description of ODRPO): the claim that independently accumulating advantages across ordinal thresholds 'structurally isolates' stochastic evaluation noise is not accompanied by a derivation showing that the summed advantage remains unbiased or lower-variance when all I_k are generated from the identical prompt and sampling run. The correlation term that appears in Var(sum A_k) under shared auto-rater stochasticity is neither bounded nor shown to vanish, which directly undermines the central robustness argument.
[Theoretical analysis] Theoretical analysis: the stability proof assumes noise independence across thresholds. Because the indicators are deterministically linked through the same auto-rater output, this assumption is violated; the manuscript must either relax the independence premise or supply a corrected variance bound that accounts for the joint distribution of the ordinal vector.

minor comments (2)

[Empirical Evaluation] Empirical section: the reported relative improvements lack error bars, number of random seeds, or statistical significance tests; these details are needed to assess whether the gains are robust.
[Method section] Notation: the thresholds t_k and the resulting binary indicators I_k should be defined with an explicit equation at first use rather than only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments have prompted us to strengthen the theoretical grounding of ODRPO. We respond to each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [Method section] Method section (description of ODRPO): the claim that independently accumulating advantages across ordinal thresholds 'structurally isolates' stochastic evaluation noise is not accompanied by a derivation showing that the summed advantage remains unbiased or lower-variance when all I_k are generated from the identical prompt and sampling run. The correlation term that appears in Var(sum A_k) under shared auto-rater stochasticity is neither bounded nor shown to vanish, which directly undermines the central robustness argument.

Authors: We agree that the original manuscript did not supply an explicit derivation of the summed advantage under the joint distribution induced by a single auto-rater call. In the revised version we add a new derivation in Section 3.2 that explicitly computes Var(sum_k A_k) and bounds the covariance terms. Because the ordinal indicators satisfy I_k >= I_{k+1} almost surely, the positive correlations are controlled by the monotonicity of the threshold sequence; the resulting bound shows that the total variance remains strictly smaller than that of a monolithic advantage estimator for any finite number of thresholds. This establishes the robustness claim without requiring noise independence. revision: yes
Referee: [Theoretical analysis] Theoretical analysis: the stability proof assumes noise independence across thresholds. Because the indicators are deterministically linked through the same auto-rater output, this assumption is violated; the manuscript must either relax the independence premise or supply a corrected variance bound that accounts for the joint distribution of the ordinal vector.

Authors: The referee correctly identifies that the original stability argument invoked an independence assumption across thresholds that does not hold exactly. We have revised the theoretical analysis (Section 4) to remove this assumption. The updated proof derives a variance bound directly on the joint distribution of the ordinal vector by exploiting the deterministic ordering I_1 >= ... >= I_K. The corrected bound confirms that the policy-gradient update remains stable and that the implicit curriculum effect persists under the realistic noise model arising from a single auto-rater evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ODRPO derivation chain

full rationale

The paper defines ODRPO via ordinal decomposition of discrete rewards r into binary indicators I_k = 1{r >= t_k} for ordered thresholds, then independently accumulates advantages A_k before policy update. This construction is presented as a structural change to the estimator rather than a quantity fitted from or defined in terms of the target performance metrics. No equations reduce by construction to the inputs (e.g., no fitted parameter renamed as prediction, no self-citation load-bearing the uniqueness or stability claim). The claimed theoretical analysis of optimization stability is external to the decomposition itself and does not tautologically restate the noise-isolation assumption. Empirical gains on FACTS-grounding-v2 and Alpaca-Evals are reported as independent validation, keeping the framework self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that auto-rater stochasticity can be isolated by threshold decomposition without introducing new biases or requiring additional sampling.

axioms (1)

domain assumption LLM-based auto-raters produce inherently stochastic discrete rewards due to prompt sensitivity and sampling randomness.
Invoked in the abstract as the core motivation and verified empirically by the authors.

pith-pipeline@v0.9.0 · 5879 in / 1106 out tokens · 43331 ms · 2026-05-19T14:29:39.076691+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We decompose the single scalar reward into multiple sub-rewards representing ordinal success levels, compute the advantage for each level independently, and then accumulate these values... r^{(k)}_i = I{r_i >= k}, A^{(k)}_i = (r^{(k)}_i - μ^{(k)}) / N^{(k)}, A_i = sum_k A^{(k)}_i
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ODRPO admits an objective function J(θ) with gradient... J(θ) = E[ (2/π) sum_m Δ_m arcsin(sqrt(P_m)) ]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

[1]

doi: 10.1038/s41586-025-09422-z

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and others , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=

work page doi:10.1038/s41586-025-09422-z
[2]

2025 , eprint=

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles , author=. 2025 , eprint=

work page 2025
[3]

Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy , year=

James Burgess and Jan N. Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy , year=. PaperSearchQA: Learning to Search and Reason over Scientific Papers with. 2601.18207 , archivePrefix=

work page arXiv
[4]

2025 , eprint=

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs , author=. 2025 , eprint=

work page 2025
[5]

2026 , eprint=

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training , author=. 2026 , eprint=

work page 2026
[6]

One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

Yulai Zhao and Haolin Liu and Dian Yu and Sunyuan Kung and Meijia Chen and Haitao Mi and Dong Yu , year=. One Token to Fool. 2507.08794 , archivePrefix=

work page arXiv
[7]

Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

Lin Shi and Chiyu Ma and Wenhua Liang and Xingjian Diao and Weicheng Ma and Soroush Vosoughi , year=. Judging the Judges: A Systematic Study of Position Bias in. 2406.07791 , archivePrefix=

work page arXiv
[8]

Evaluating scoring bias in llm-as-a-judge.arXiv preprint arXiv:2506.22316, 2025

Qingquan Li and Shaoyu Dou and Kailai Shao and Chao Chen and Haixiang Hu , year=. Evaluating Scoring Bias in. 2506.22316 , archivePrefix=

work page internal anchor Pith review arXiv
[9]

Thinking Machines Lab: Connectionism , year =

Horace He and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

work page
[10]

Jacky Kwok and Shulu Li and Pranav Atreya and Yuejiang Liu and Marco Pavone and Ion Stoica and Azalia Mirhoseini , year=

work page
[11]

Harrison Lee and Samrat Phatale and Hassan Mansoor and Kellie Ren Lu and Thomas Mesnard and Johan Ferret and Colton Bishop and Ethan Hall and Victor Carbune and Abhinav Rastogi , year=

work page
[12]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[13]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[14]

2026 , eprint=

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

work page 2026
[15]

2025 , eprint=

What is the objective of reasoning with reinforcement learning? , author=. 2025 , eprint=

work page 2025
[16]

2026 , eprint=

Maximum Likelihood Reinforcement Learning , author=. 2026 , eprint=

work page 2026
[17]

Qwen2 Technical Report

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[19]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[20]

2023 , eprint=

UltraFeedback: Boosting Language Models with High-quality Feedback , author=. 2023 , eprint=

work page 2023
[21]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[22]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

work page 2023
[23]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and others , title =. doi:10.5281/zenodo.12608602 , url =

work page doi:10.5281/zenodo.12608602
[24]

2025 , eprint=

RewardBench 2: Advancing Reward Model Evaluation , author=. 2025 , eprint=

work page 2025
[25]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy , author =. arXiv preprint arXiv:2507.01352 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2025 , eprint=

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality , author=. 2025 , eprint=

work page 2025
[27]

2025 , url=

Gemini 2.5 Flash Model Card , author=. 2025 , url=

work page 2025
[28]

2025 , url=

Gemini 3 Flash: Frontier intelligence built for speed , author=. 2025 , url=

work page 2025
[29]

2026 , url=

Gemini 3.1 Flash-Lite Model Card , author=. 2026 , url=

work page 2026
[30]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

work page 2023
[31]

2020 , eprint=

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , author=. 2020 , eprint=

work page 2020
[32]

The Fourteenth International Conference on Learning Representations , year=

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[33]

2025 , eprint=

Reinforcement Learning with Rubric Anchors , author=. 2025 , eprint=

work page 2025
[34]

2026 , eprint=

Stepwise Credit Assignment for GRPO on Flow-Matching Models , author=. 2026 , eprint=

work page 2026
[35]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024
[36]

2025 , eprint=

Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs , author=. 2025 , eprint=

work page 2025
[37]

M. G. Kendall and B. Babington Smith , journal =. The Problem of m Rankings , urldate =

work page
[38]

doi:10.21105/joss.01026

Vallat, Raphael , year =. Pingouin: statistics in Python , volume =. Journal of Open Source Software , publisher =. doi:10.21105/joss.01026 , number =

work page doi:10.21105/joss.01026
[39]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[40]

2025 , eprint=

Hummer: Towards Limited Competitive Preference Dataset , author=. 2025 , eprint=

work page 2025
[41]

2026 , eprint=

Less is More: Improving LLM Alignment via Preference Data Selection , author=. 2026 , eprint=

work page 2026

[1] [1]

doi: 10.1038/s41586-025-09422-z

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and others , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=

work page doi:10.1038/s41586-025-09422-z

[2] [2]

2025 , eprint=

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles , author=. 2025 , eprint=

work page 2025

[3] [3]

Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy , year=

James Burgess and Jan N. Hansen and Duo Peng and Yuhui Zhang and Alejandro Lozano and Min Woo Sun and Emma Lundberg and Serena Yeung-Levy , year=. PaperSearchQA: Learning to Search and Reason over Scientific Papers with. 2601.18207 , archivePrefix=

work page arXiv

[4] [4]

2025 , eprint=

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs , author=. 2025 , eprint=

work page 2025

[5] [5]

2026 , eprint=

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training , author=. 2026 , eprint=

work page 2026

[6] [6]

One token to fool llm-as-a-judge.arXiv preprint arXiv:2507.08794, 2025

Yulai Zhao and Haolin Liu and Dian Yu and Sunyuan Kung and Meijia Chen and Haitao Mi and Dong Yu , year=. One Token to Fool. 2507.08794 , archivePrefix=

work page arXiv

[7] [7]

Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2025

Lin Shi and Chiyu Ma and Wenhua Liang and Xingjian Diao and Weicheng Ma and Soroush Vosoughi , year=. Judging the Judges: A Systematic Study of Position Bias in. 2406.07791 , archivePrefix=

work page arXiv

[8] [8]

Evaluating scoring bias in llm-as-a-judge.arXiv preprint arXiv:2506.22316, 2025

Qingquan Li and Shaoyu Dou and Kailai Shao and Chao Chen and Haixiang Hu , year=. Evaluating Scoring Bias in. 2506.22316 , archivePrefix=

work page internal anchor Pith review arXiv

[9] [9]

Thinking Machines Lab: Connectionism , year =

Horace He and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

work page

[10] [10]

Jacky Kwok and Shulu Li and Pranav Atreya and Yuejiang Liu and Marco Pavone and Ion Stoica and Azalia Mirhoseini , year=

work page

[11] [11]

Harrison Lee and Samrat Phatale and Hassan Mansoor and Kellie Ren Lu and Thomas Mesnard and Johan Ferret and Colton Bishop and Ethan Hall and Victor Carbune and Abhinav Rastogi , year=

work page

[12] [12]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017

[13] [13]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024

[14] [14]

2026 , eprint=

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

work page 2026

[15] [15]

2025 , eprint=

What is the objective of reasoning with reinforcement learning? , author=. 2025 , eprint=

work page 2025

[16] [16]

2026 , eprint=

Maximum Likelihood Reinforcement Learning , author=. 2026 , eprint=

work page 2026

[17] [17]

Qwen2 Technical Report

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[19] [19]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[20] [20]

2023 , eprint=

UltraFeedback: Boosting Language Models with High-quality Feedback , author=. 2023 , eprint=

work page 2023

[21] [21]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024

[22] [22]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

work page 2023

[23] [23]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and others , title =. doi:10.5281/zenodo.12608602 , url =

work page doi:10.5281/zenodo.12608602

[24] [24]

2025 , eprint=

RewardBench 2: Advancing Reward Model Evaluation , author=. 2025 , eprint=

work page 2025

[25] [25]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy , author =. arXiv preprint arXiv:2507.01352 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2025 , eprint=

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality , author=. 2025 , eprint=

work page 2025

[27] [27]

2025 , url=

Gemini 2.5 Flash Model Card , author=. 2025 , url=

work page 2025

[28] [28]

2025 , url=

Gemini 3 Flash: Frontier intelligence built for speed , author=. 2025 , url=

work page 2025

[29] [29]

2026 , url=

Gemini 3.1 Flash-Lite Model Card , author=. 2026 , url=

work page 2026

[30] [30]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

work page 2023

[31] [31]

2020 , eprint=

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , author=. 2020 , eprint=

work page 2020

[32] [32]

The Fourteenth International Conference on Learning Representations , year=

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[33] [33]

2025 , eprint=

Reinforcement Learning with Rubric Anchors , author=. 2025 , eprint=

work page 2025

[34] [34]

2026 , eprint=

Stepwise Credit Assignment for GRPO on Flow-Matching Models , author=. 2026 , eprint=

work page 2026

[35] [35]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024

[36] [36]

2025 , eprint=

Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs , author=. 2025 , eprint=

work page 2025

[37] [37]

M. G. Kendall and B. Babington Smith , journal =. The Problem of m Rankings , urldate =

work page

[38] [38]

doi:10.21105/joss.01026

Vallat, Raphael , year =. Pingouin: statistics in Python , volume =. Journal of Open Source Software , publisher =. doi:10.21105/joss.01026 , number =

work page doi:10.21105/joss.01026

[39] [39]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[40] [40]

2025 , eprint=

Hummer: Towards Limited Competitive Preference Dataset , author=. 2025 , eprint=

work page 2025

[41] [41]

2026 , eprint=

Less is More: Improving LLM Alignment via Preference Data Selection , author=. 2026 , eprint=

work page 2026