arxiv: 2601.05242 · v1 · submitted 2026-01-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu , Xin Dong , Ximing Lu , Shizhe Diao , Peter Belcak , Mingjie Liu , Min-Hung Chen , Hongxu Yin

show 5 more authors

Yu-Chiang Frank Wang Kwang-Ting Cheng Yejin Choi Jan Kautz Pavlo Molchanov

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords multi-reward RLpolicy optimizationadvantage normalizationGRPOGDPOlanguage model alignmentreinforcement learningtool calling

0 comments

The pith

Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct application of group relative policy optimization across multiple distinct rewards causes the computed advantages from different reward combinations to become identical, which flattens the training signal and produces suboptimal convergence or outright training failure. GDPO fixes this by normalizing each reward separately before any aggregation step, so that relative differences among the rewards remain visible to the policy update. The method is tested on tool calling, math reasoning, and coding tasks, where it improves both accuracy and format adherence metrics while avoiding the instabilities seen with the prior approach.

Core claim

Directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. GDPO resolves these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability.

What carries the argument

Group reward-Decoupled Normalization, which normalizes each individual reward signal on its own before aggregation to keep distinct advantage scales intact.

If this is right

GDPO produces higher accuracy and lower bug ratios than GRPO on tool calling, math reasoning, and coding reasoning tasks.
GDPO improves constraint metrics such as format adherence and response length control in the same settings.
Training runs with GDPO avoid the early collapse and instability observed with direct GRPO application.
The decoupled normalization approach generalizes across different multi-reward combinations used for language model alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling step could be inserted into other policy-gradient methods that aggregate multiple reward signals.
Increasing the number of simultaneous rewards may amplify the benefit of independent normalization.
The technique offers a practical route for scaling alignment objectives without sacrificing signal resolution.

Load-bearing premise

Separately normalizing each reward before aggregation will preserve their relative differences without creating new scaling problems or instabilities.

What would settle it

Run both GRPO and GDPO on identical multi-reward rollouts and check whether the resulting advantage vectors remain distinct under GDPO while collapsing under GRPO; training divergence or early failure only under GRPO would confirm the claim.

read the original abstract

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GDPO decouples per-reward normalization to avoid advantage collapse in multi-reward GRPO, with consistent gains on tool, math, and coding tasks, though scale-variance handling remains untested.

read the letter

The main thing to know is that GDPO stops the advantage values from collapsing to the same number when GRPO is applied to several rewards at once. By normalizing each reward separately before aggregation, the method keeps more of the original signal differences and leads to steadier training plus better results on the three tasks they ran. The authors show this on tool calling, math reasoning, and coding, tracking both accuracy or bug rates and format or length constraints, with GDPO ahead of plain GRPO in every case. That practical improvement is the core contribution. The idea is simple enough that teams already using multi-reward RL for alignment might try it directly if they see early failures or flat signals. What the paper does well is name a failure mode that shows up once you stack distinct rewards and then supply a targeted change that appears to restore resolution without extra machinery. The experiments cover realistic settings and report both performance and constraint metrics, which makes the comparison more useful than pure accuracy numbers. The soft spots sit in the mechanics of the aggregation step. If the individual rewards have mismatched variances or ranges, separate normalization can still let the highest-variance term dominate the final advantage, and the work does not include re-weighting or controlled scale-variation ablations to check this. The stress-test concern holds up on the evidence given: the fix might relocate rather than remove the scaling problem. Without the exact update equations or those extra checks, it is hard to judge how robust the gains are outside the reported setups. This paper is aimed at engineers running multi-reward RL pipelines on language models who have hit stability issues. Readers who need a quick, testable adjustment will get value from it. It deserves a serious referee because the identified problem is common in current alignment work and the proposed change is concrete enough to evaluate properly, even if reviewers will likely ask for more analysis on reward-scale sensitivity.

Referee Report

2 major / 2 minor

Summary. The paper claims that directly applying GRPO to multi-reward RL settings causes distinct rollout reward combinations to collapse into identical advantage values, degrading the training signal; it introduces GDPO to decouple per-reward normalization before aggregation, preserving relative differences and improving stability, with consistent gains over GRPO on tool-calling, math-reasoning, and coding tasks measured by correctness and constraint metrics.

Significance. If the collapse mechanism and the benefit of decoupled normalization hold under rigorous verification, GDPO would offer a practical improvement for multi-objective RLHF pipelines in language models, where diverse preference rewards are increasingly common; the empirical consistency across three tasks is a modest strength, though the absence of derivations or controlled ablations limits the assessed impact.

major comments (2)

[Abstract] Abstract and method description: the collapse of distinct rewards into identical advantages under GRPO is asserted without any equations, derivation, or explicit advantage formula showing how normalization produces this outcome; this makes the central motivation load-bearing yet unverifiable from the provided text.
[Experiments] Experiments section: no scale-variation ablations or variance-heterogeneity tests are reported despite the skeptic concern that per-reward z-scores can still distort the composite advantage when reward dynamic ranges differ; without these, the claim that GDPO faithfully preserves differences remains untested.

minor comments (2)

Define acronyms (GRPO, GDPO) on first use and ensure consistent notation for advantage and normalization terms throughout.
Add error bars, run counts, or statistical tests to the reported outperformance metrics to strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments have helped us identify areas where the presentation can be improved for clarity and rigor. Below, we provide point-by-point responses to the major comments, outlining the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the collapse of distinct rewards into identical advantages under GRPO is asserted without any equations, derivation, or explicit advantage formula showing how normalization produces this outcome; this makes the central motivation load-bearing yet unverifiable from the provided text.

Authors: We thank the referee for pointing this out. To address this, we will revise the method description to include the explicit GRPO advantage formula and a derivation demonstrating the collapse of distinct multi-reward combinations into identical advantages. This will make the motivation verifiable and strengthen the paper's foundation. revision: yes
Referee: [Experiments] Experiments section: no scale-variation ablations or variance-heterogeneity tests are reported despite the skeptic concern that per-reward z-scores can still distort the composite advantage when reward dynamic ranges differ; without these, the claim that GDPO faithfully preserves differences remains untested.

Authors: We acknowledge the need for more rigorous testing of GDPO under varying reward scales. In the revised manuscript, we will include additional ablation experiments that vary the dynamic ranges and variances of the individual rewards to verify that GDPO preserves relative differences better than GRPO. These will be presented in the experiments section or appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: GDPO is an independent algorithmic adjustment with no reduction to fitted inputs or self-citations

full rationale

The paper identifies an empirical failure mode of GRPO under multi-reward rollouts (collapse of advantage values) and proposes GDPO as a direct decoupling of per-reward normalization. No equations, derivations, or parameter-fitting steps are shown that would make the claimed improvement equivalent to a quantity defined by the method itself. The central claim rests on the algorithmic description rather than any self-citation chain, uniqueness theorem, or ansatz imported from prior work by the same authors. The derivation chain is therefore self-contained as a proposed fix to an observed limitation, with no load-bearing step that reduces by construction to the inputs or outputs of GDPO.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the core addition is the decoupling step whose implementation details and any implicit scaling choices are not stated.

pith-pipeline@v0.9.0 · 5582 in / 1057 out tokens · 48092 ms · 2026-05-15T05:26:51.583532+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
cs.CV 2026-05 unverdicted novelty 6.0

BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
RVPO: Risk-Sensitive Alignment via Variance Regularization
cs.LG 2026-05 unverdicted novelty 6.0

RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
cs.CV 2026-04 unverdicted novelty 6.0

AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
Target Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
cs.AI 2026-05 unverdicted novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis
cs.CR 2026-05 unverdicted novelty 5.0

Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gain...
LASER: Learning Active Sensing for Continuum Field Reconstruction
cs.LG 2026-04 unverdicted novelty 5.0

LASER trains a reinforcement learning policy inside a latent dynamics model to choose sensor placements that improve reconstruction of continuum fields under sparsity.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
eess.SP 2026-04 unverdicted novelty 5.0

TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 19 Pith papers · 17 internal anchors

[1]

Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

work page arXiv 2025
[2]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 37:108877–108901, 2024

Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 37:108877–108901, 2024

work page 2024
[5]

Grpo-care: Consistency- aware reinforcement learning for multimodal reasoning, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo-care: Consistency- aware reinforcement learning for multimodal reasoning, 2025

work page 2025
[6]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Genderalign: An alignment dataset for mitigating gender bias in large language models

Tao Zhang, Ziqian Zeng, YuxiangXiao YuxiangXiao, Huiping Zhuang, Cen Chen, James R Foulds, and Shimei Pan. Genderalign: An alignment dataset for mitigating gender bias in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11293–11311, 2025

work page 2025
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024
[16]

Hammer: Robust function-calling for on-device language models via function masking

Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al. Hammer: Robust function-calling for on-device language models via function masking. arXiv preprint arXiv:2410.04587, 2024

work page arXiv 2024
[17]

xlam: A family of large action models to empower ai agent systems

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Quoc Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. xlam: A family of large action models to empower ai agent systems. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techno...

work page 2025
[18]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[19]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning

work page
[21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1- 5B-Model-by-Scaling-RL-19681902c1468005bed8ca303...

work page 2025
[23]

American invitational mathematics examination - aime.In American Invitational Mathematics Examination - AIME 2024, 2024

MAA. American invitational mathematics examination - aime.In American Invitational Mathematics Examination - AIME 2024, 2024

work page 2024
[24]

American invitational mathematics examination - amc.In American Invitational Mathematics Examination - AMC, 2024

MAA. American invitational mathematics examination - amc.In American Invitational Mathematics Examination - AMC, 2024

work page 2024
[25]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[27]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

work page 2022
[31]

Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.ArXiv, abs/2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Sample more to think less: Group filtered policy optimization for concise reasoning

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Singh Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. ArXiv, abs/2508.09726, 2025

work page arXiv 2025
[36]

Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.ArXiv, abs/2510.15110, 2025

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.ArXiv, abs/2510.15110, 2025

work page arXiv 2025
[37]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.ArXiv, abs/2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Personalized soups: Per- sonalized large language model alignment via post-hoc pa- rameter merging.arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke S. Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.ArXiv, abs/2310.11564, 2023

work page arXiv 2023
[39]

Alarm: Align language models via hierarchical rewards modeling

Yuhang Lai, Siyuan Wang, Shujun Liu, Xuanjing Huang, and Zhongyu Wei. Alarm: Align language models via hierarchical rewards modeling. InAnnual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[40]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bing-Li Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao...

work page 2025
[41]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

work page arXiv 2025
[42]

Training language models to reason efficiently.ArXiv, abs/2502.04463, 2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently.ArXiv, abs/2502.04463, 2025

work page arXiv 2025
[43]

Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.ArXiv, abs/2504.21370, 2025

Jingyang Yi and Jiazheng Wang. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.ArXiv, abs/2504.21370, 2025

work page arXiv 2025
[44]

L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

work page arXiv 2025
[45]

Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.ArXiv, abs/2505.18298, 2025

Jinyan Su and Claire Cardie. Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.ArXiv, abs/2505.18298, 2025. 17 GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization A. Training stability issue of GDPO without batch-wise advantage normal- ization 0 20 40 60 80 100 Steps 0.0 0.2 ...

work page arXiv 2025
[46]

name”: “Tool name

Respond Appropriately: If a response is needed, generate one while maintaining consistency across user queries. Output Format <think>Your thoughts and reasoning</think> <tool_call> {“name”: “Tool name”, “parameters”: {“Parameter name”: “Parameter content”, “ ... ...”: “ ... ...”}} {“name”: “ ... ...”, “parameters”: {“ ... ...”: “ ... ...”, “ ... ...”: “ ....

work page
[47]

Provide at least one of <tool_call> or <response>

You must always include the<think> field to outline your reasoning. Provide at least one of <tool_call> or <response>. Decide whether to use <tool_call> (possibly multiple times), <response>, or both

work page
[48]

name” field and a “parameters

You can invoke multiple tool calls simultaneously in the<tool_call> fields. Each tool call should be a JSON object with a “name” field and a “parameters” field containing a dictionary of parameters. If no parameters are needed, leave the “parameters” field an empty dictionary

work page
[49]

Refer to the previous dialogue records in the history, including the user’s queries, previous <tool_call>,<response>, and any tool feedback noted as<obs>(if exists). User Prompt for T oolRL T raining Dialogue History <user>{ { Initial User Input } }</user> <think>Round 1 Model Thought</think> { { Round 1 model output<tool_call>or<response>} } <obs>Round 1...

work page 2048