GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

arxiv: 2510.23868 · v5 · submitted 2025-10-27 · 💻 cs.LG · cs.CL

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang This is my paper

Pith reviewed 2026-05-18 03:37 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords GIFTGRPODPOimplicit fine-tuninggroup-relative samplingRLHFpolicy optimizationLLM alignment

0 comments

The pith

The population minimizers of the GIFT loss coincide exactly with the GRPO/RLHF solution family using a prompt-dependent KL coefficient from reward variances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Group-Relative Implicit Fine-Tuning to combine group sampling, implicit rewards, and MSE loss for on-policy RL of language models. Z-score standardization of the implicit rewards from the DPO-style term cancels the partition function Z(x), removing the need for an external scalar beta in the objective. The resulting population minimizers match the GRPO family of policies exactly, but with an endogenous beta(x) set by the ratio of reward standard deviation to policy standard deviation. This equivalence holds without extra assumptions on the reward model. Experiments on 7B-32B models show faster convergence and reduced overfitting on reasoning tasks plus higher win rates on preference benchmarks.

Core claim

By applying z-score standardization to implicit rewards in a loss that mixes GRPO-style group sampling, DPO-style implicit rewards, and UNA-style MSE, the population minimizers of L_GIFT are exactly the policies pi*_beta(y|x) proportional to pi_ref(y|x) exp((1/beta) r_phi(x,y)), where the effective KL coefficient is the prompt-dependent value beta(x) = sigma_phi(x) / sigma_hat_theta(x).

What carries the argument

Z-score standardization applied to DPO-style implicit rewards, which cancels the partition function Z(x) so that the MSE objective yields closed-form equivalence to the GRPO solution family.

If this is right

GIFT reaches the same parametric policy family as GRPO while replacing the externally tuned scalar beta with a prompt-adaptive beta(x) optimized by matching reward distributions.
On RLVR tasks the method converges faster than GRPO, DAPO and GSPO while overfitting less.
On RLHF tasks it yields higher length-controlled win rates than the compared baselines.
The KL coefficient is removed from the explicit objective and replaced by endogenous variance-driven adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Variance-driven adaptation of the KL term could reduce manual hyperparameter search in other on-policy alignment methods.
The same standardization step might be tested for compatibility with additional implicit-reward objectives beyond DPO.
If the closed-form equivalence generalizes, GIFT could serve as a drop-in replacement for GRPO in settings where prompt-specific regularization is desirable.

Load-bearing premise

Z-score standardization of the implicit rewards from the DPO formulation cancels the intractable partition function Z(x) and lets the MSE objective produce exact closed-form equivalence to GRPO.

What would settle it

A direct computation or sampling experiment in which the argmin of L_GIFT differs from the GRPO family when reward variance is high or group sampling deviates from the assumed distribution would falsify the equivalence.

Figures

Figures reproduced from arXiv: 2510.23868 by Zhichao Wang.

**Figure 2.** Figure 2: (a) Impact of rollout numbers (N = 1, 2, 4, 8, 16, 32) during fine-tuning; (b) Comparison of implicit reward definitions: summation (kl_sum) vs. averaging (kl_average) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of GIFT and GRPO on DeepSeek-7B using GSM8K and MATH datasets. Training and evaluation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of GIFT and GRPO on Qwen2.5-32B using GSM8K and MATH datasets. GIFT achieves faster [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of GIFT and GRPO on Qwen3-32B-base and Qwen2.5-32B-base using DAPO dataset for [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of GIFT and GRPO on Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct using INFINITY [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $\beta$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $\pi^{*}_{\beta}(y|x)\propto\pi_{\text{ref}}(y|x)e^{\frac{1}{\beta}r_{\phi}(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $\beta(x)=\frac{\sigma_\phi(x)}{\hat{\sigma}_\theta(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $\beta$ with a prompt-adaptive $\beta(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GIFT derives a prompt-adaptive beta from variance matching after z-score standardization, recovering the same GRPO policy family without a fixed scalar hyperparameter.

read the letter

The main takeaway is that GIFT uses z-score standardization on DPO-style implicit rewards to cancel the partition function and then matches variances to produce an endogenous, prompt-dependent beta. This keeps the optimal policy exactly in the GRPO/RLHF family while removing the need to tune beta externally. The stress-test note confirms the population math goes through cleanly with no extra assumptions required on the reward model or sampling measure beyond the shared distribution for the z-scores. That is the actual new piece: the closed-form characterization of the minimizers with beta(x) = sigma_phi(x) / hat_sigma_theta(x).

Referee Report

2 major / 3 minor

Summary. The paper proposes Group-Relative Implicit Fine-Tuning (GIFT), which integrates GRPO-style group sampling, DPO-style implicit rewards defined via log-ratios, and UNA-style MSE objectives between standardized implicit and explicit advantages. By applying z-score standardization to both the implicit reward s(y|x) = log(π_θ(y|x)/π_ref(y|x)) and the explicit reward r_φ under the same measure, the intractable partition function Z(x) is canceled, yielding a closed-form characterization of the population minimizers of L_GIFT. These minimizers coincide exactly with the GRPO/RLHF solution family π*_β(y|x) ∝ π_ref(y|x) exp(r_φ(x,y)/β) but with an endogenous, prompt-dependent KL coefficient β(x) = σ_φ(x)/σ̂_θ(x) determined by the ratio of reward variances. Empirically, GIFT is shown to converge faster than GRPO, DAPO and GSPO on 7B-32B models, overfit less on RLVR tasks (GSM8K, MATH, AIME), and achieve higher length-controlled win rates on RLHF benchmarks (AlpacaEval, Arena-Hard). All proofs are deferred to the appendix.

Significance. If the population-level equivalence holds, GIFT supplies a theoretically grounded mechanism for replacing GRPO's externally tuned scalar β with a variance-derived adaptive β(x) that is optimized endogenously through reward-distribution matching. This addresses a practical pain point in RLHF/RLVR by reducing hyperparameter sensitivity while preserving the same optimal policy family. The reported empirical gains in convergence speed, reduced overfitting, and win rates on standard benchmarks indicate potential practical utility for on-policy fine-tuning of LLMs. The explicit deferral of proofs to the appendix and the provision of reproducible experimental details would further strengthen verifiability.

major comments (2)

[§3.2] §3.2 (or equivalent derivation section): the claim that z-score standardization on s(y|x) and r_φ exactly cancels -log Z(x) and produces MSE = 0 at the GRPO optimum requires explicit verification that the standardization is performed under the identical population measure for both quantities; if the sampling distribution for estimating σ̂_θ differs from that used for σ_φ, the cancellation may hold only approximately.
[Theorem 1] Theorem 1 (population minimizer characterization): the derivation that log(π_θ/π_ref) = C + K r_φ with K = σ_φ/σ̂_θ recovers the RLHF optimality condition for β(x) = 1/K is load-bearing; the manuscript should include a short self-contained proof sketch in the main text (rather than solely in the appendix) showing that direct substitution of this candidate policy indeed yields MSE = 0 without additional assumptions on the reward model.

minor comments (3)

The notation for the estimated standard deviation σ̂_θ(x) should be defined at first use and distinguished clearly from the population σ_φ(x) to avoid reader confusion in the variance-ratio definition of β(x).
Figure 2 (or equivalent convergence plot): axis labels and legend should explicitly state whether the x-axis is training steps or tokens and whether win rates are length-controlled; current presentation makes direct comparison to GRPO baselines harder.
The abstract states that 'all proofs and detailed background are deferred to the appendix'; a one-sentence pointer in the main text (e.g., 'see Appendix A for the full derivation of the population minimizer') would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, constructive feedback, and positive recommendation. We address each major comment below and will revise the manuscript accordingly to strengthen clarity and verifiability.

read point-by-point responses

Referee: [§3.2] §3.2 (or equivalent derivation section): the claim that z-score standardization on s(y|x) and r_φ exactly cancels -log Z(x) and produces MSE = 0 at the GRPO optimum requires explicit verification that the standardization is performed under the identical population measure for both quantities; if the sampling distribution for estimating σ̂_θ differs from that used for σ_φ, the cancellation may hold only approximately.

Authors: We appreciate this observation on the measure used for standardization. In the GIFT procedure, both the implicit advantage (derived from s(y|x)) and the explicit advantage (from r_φ) are z-score standardized using mean and standard deviation computed from the exact same group of on-policy samples drawn from π_θ for each prompt x. Consequently, the empirical measure is identical by construction for both quantities. At the population level, the standardization is performed with respect to the distribution induced by the current policy π_θ, which is shared. We will insert a clarifying sentence in §3.2 explicitly noting that the z-score normalization employs the identical sampling distribution for implicit and explicit advantages, thereby confirming exact cancellation of -log Z(x) under this shared measure. revision: yes
Referee: [Theorem 1] Theorem 1 (population minimizer characterization): the derivation that log(π_θ/π_ref) = C + K r_φ with K = σ_φ/σ̂_θ recovers the RLHF optimality condition for β(x) = 1/K is load-bearing; the manuscript should include a short self-contained proof sketch in the main text (rather than solely in the appendix) showing that direct substitution of this candidate policy indeed yields MSE = 0 without additional assumptions on the reward model.

Authors: We agree that a self-contained sketch in the main text would enhance readability. We will add a brief, self-contained proof sketch immediately after the statement of Theorem 1. The sketch will substitute the candidate policy π_θ(y|x) ∝ π_ref(y|x) exp((σ_φ(x)/σ̂_θ(x)) r_φ(x,y)) directly into the population MSE objective, verify that the standardized implicit and explicit advantages become identical (hence MSE = 0), and confirm that this holds under the definitions of the z-scores without requiring further assumptions on the reward model beyond the shared sampling measure. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper characterizes the population minimizers of L_GIFT in closed form by z-score standardization on DPO-style implicit rewards and explicit advantages, which cancels the partition function and yields exact equivalence to the GRPO/RLHF policy family with endogenous β(x) = σ_φ(x)/σ̂_θ(x). This is a direct mathematical result: the candidate policy achieves MSE=0 under the standardized objective, recovering the variance ratio self-consistently in the population limit with no extra assumptions on the reward model or sampling. The central claim is an equivalence proof rather than a fitted parameter renamed as prediction, a self-definition, or load-bearing self-citation; the derivation remains independent of its inputs and externally verifiable via the stated population distribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach relies on standard assumptions from RLHF literature plus the specific standardization trick; no new entities are introduced, but the adaptive beta is derived rather than free.

free parameters (1)

prompt-dependent variance estimates
σ_φ(x) and σ̂_θ(x) are used to define β(x) and are likely estimated from the reward model and policy outputs.

axioms (2)

domain assumption Z-score standardization cancels the partition function Z(x) in the DPO implicit reward
This is invoked to eliminate the KL coefficient β from the objective.
domain assumption The MSE between implicit and explicit advantages leads to the stated closed-form minimizers
Central to characterizing the population minimizers of L_GIFT.

pith-pipeline@v0.9.0 · 5809 in / 1701 out tokens · 46931 ms · 2026-05-18T03:37:16.005908+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

By applying z-score standardization, the intractable partition function Z(x) in the DPO implicit reward is canceled... β(x)=σ_φ(x)/σ̂_θ(x)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the population minimizers of L_GIFT coincide exactly with the GRPO/RLHF solution family

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Goucher, and et al

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, and et al. Gpt-4o system card, 2024

work page 2024
[2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and et al. Language models are few-shot learners, 2020

work page 2020
[3]

Wainwright, and et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, and et al. Training language models to follow instructions with human feedback, 2022

work page 2022
[4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[5]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[6]

Manning, and et al

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and et al. Direct preference optimization: Your language model is secretly a reward model, 2024. 10 GIFT: Group-relative Implicit Fine Tuning Integrates GRPO, DPO and UNA

work page 2024
[7]

Una: Unifying alignments of rlhf/ppo, dpo and kto by a generalized implicit reward function, 2025

Zhichao Wang, Bin Bi, Can Huang, Shiva Kumar Pentyala, Zixu James Zhu, and et al. Una: Unifying alignments of rlhf/ppo, dpo and kto by a generalized implicit reward function, 2025

work page 2025
[8]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, and et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, and et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, and et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

work page 2025
[12]

Aime 2024 dataset

Maxwell-Jia. Aime 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024, 2025. Accessed: 2025-01-04

work page 2024
[13]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, and et al. Qwen3 technical report, 2025

work page 2025
[14]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, and et al. Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

work page 2025
[15]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, and et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Truthfulqa: Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022
[17]

BBQ: A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, and et al. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, May

work page 2022
[18]

Association for Computational Linguistics

work page
[19]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, and et al. Program synthesis with large language models, 2021

work page 2021
[20]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, and et al. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[21]

Gender bias in coreference resolution, 2018

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution, 2018

work page 2018
[22]

Gpqa: A graduate-level google-proof q&a benchmark, 2023

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, and et al. Gpqa: A graduate-level google-proof q&a benchmark, 2023

work page 2023
[23]

Musr: Testing the limits of chain-of- thought with multistep soft reasoning, 2024

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of- thought with multistep soft reasoning, 2024

work page 2024
[24]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal’azs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Gonzalez, and Ion Sto- ica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Sto- ica. From live data to high-quality benchmarks: The arena-hard pipeline. https://lmsys.org/blog/ 2024-04-19-arena-hard/, April 2024

work page 2024
[26]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[27]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, and et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[28]

aha moments,

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, and et al. Group sequence policy optimization, 2025. A Related Methods A.1 RLHF and PPO After the pretraining and SFT stages, LLMs may still produce undesirable or suboptimal responses. To further improve their alignment and response quality, RLHF is applied. RLHF typically consists of two ma...

work page 2025

[1] [1]

Goucher, and et al

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, and et al. Gpt-4o system card, 2024

work page 2024

[2] [2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and et al. Language models are few-shot learners, 2020

work page 2020

[3] [3]

Wainwright, and et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, and et al. Training language models to follow instructions with human feedback, 2022

work page 2022

[4] [4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025

[5] [5]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[6] [6]

Manning, and et al

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and et al. Direct preference optimization: Your language model is secretly a reward model, 2024. 10 GIFT: Group-relative Implicit Fine Tuning Integrates GRPO, DPO and UNA

work page 2024

[7] [7]

Una: Unifying alignments of rlhf/ppo, dpo and kto by a generalized implicit reward function, 2025

Zhichao Wang, Bin Bi, Can Huang, Shiva Kumar Pentyala, Zixu James Zhu, and et al. Una: Unifying alignments of rlhf/ppo, dpo and kto by a generalized implicit reward function, 2025

work page 2025

[8] [8]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024

[9] [9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, and et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, and et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, and et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

work page 2025

[12] [12]

Aime 2024 dataset

Maxwell-Jia. Aime 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024, 2025. Accessed: 2025-01-04

work page 2024

[13] [13]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, and et al. Qwen3 technical report, 2025

work page 2025

[14] [14]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, and et al. Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

work page 2025

[15] [15]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, and et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Truthfulqa: Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022

[17] [17]

BBQ: A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, and et al. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, May

work page 2022

[18] [18]

Association for Computational Linguistics

work page

[19] [19]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, and et al. Program synthesis with large language models, 2021

work page 2021

[20] [20]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, and et al. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018

[21] [21]

Gender bias in coreference resolution, 2018

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution, 2018

work page 2018

[22] [22]

Gpqa: A graduate-level google-proof q&a benchmark, 2023

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, and et al. Gpqa: A graduate-level google-proof q&a benchmark, 2023

work page 2023

[23] [23]

Musr: Testing the limits of chain-of- thought with multistep soft reasoning, 2024

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of- thought with multistep soft reasoning, 2024

work page 2024

[24] [24]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal’azs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Gonzalez, and Ion Sto- ica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Sto- ica. From live data to high-quality benchmarks: The arena-hard pipeline. https://lmsys.org/blog/ 2024-04-19-arena-hard/, April 2024

work page 2024

[26] [26]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952

[27] [27]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, and et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024

[28] [28]

aha moments,

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, and et al. Group sequence policy optimization, 2025. A Related Methods A.1 RLHF and PPO After the pretraining and SFT stages, LLMs may still produce undesirable or suboptimal responses. To further improve their alignment and response quality, RLHF is applied. RLHF typically consists of two ma...

work page 2025