pith. machine review for the scientific record. sign in

arxiv: 2601.05242 · v1 · submitted 2026-01-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords multi-reward RLpolicy optimizationadvantage normalizationGRPOGDPOlanguage model alignmentreinforcement learningtool calling
0
0 comments X

The pith

Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct application of group relative policy optimization across multiple distinct rewards causes the computed advantages from different reward combinations to become identical, which flattens the training signal and produces suboptimal convergence or outright training failure. GDPO fixes this by normalizing each reward separately before any aggregation step, so that relative differences among the rewards remain visible to the policy update. The method is tested on tool calling, math reasoning, and coding tasks, where it improves both accuracy and format adherence metrics while avoiding the instabilities seen with the prior approach.

Core claim

Directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. GDPO resolves these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability.

What carries the argument

Group reward-Decoupled Normalization, which normalizes each individual reward signal on its own before aggregation to keep distinct advantage scales intact.

If this is right

  • GDPO produces higher accuracy and lower bug ratios than GRPO on tool calling, math reasoning, and coding reasoning tasks.
  • GDPO improves constraint metrics such as format adherence and response length control in the same settings.
  • Training runs with GDPO avoid the early collapse and instability observed with direct GRPO application.
  • The decoupled normalization approach generalizes across different multi-reward combinations used for language model alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling step could be inserted into other policy-gradient methods that aggregate multiple reward signals.
  • Increasing the number of simultaneous rewards may amplify the benefit of independent normalization.
  • The technique offers a practical route for scaling alignment objectives without sacrificing signal resolution.

Load-bearing premise

Separately normalizing each reward before aggregation will preserve their relative differences without creating new scaling problems or instabilities.

What would settle it

Run both GRPO and GDPO on identical multi-reward rollouts and check whether the resulting advantage vectors remain distinct under GDPO while collapsing under GRPO; training divergence or early failure only under GRPO would confirm the claim.

read the original abstract

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that directly applying GRPO to multi-reward RL settings causes distinct rollout reward combinations to collapse into identical advantage values, degrading the training signal; it introduces GDPO to decouple per-reward normalization before aggregation, preserving relative differences and improving stability, with consistent gains over GRPO on tool-calling, math-reasoning, and coding tasks measured by correctness and constraint metrics.

Significance. If the collapse mechanism and the benefit of decoupled normalization hold under rigorous verification, GDPO would offer a practical improvement for multi-objective RLHF pipelines in language models, where diverse preference rewards are increasingly common; the empirical consistency across three tasks is a modest strength, though the absence of derivations or controlled ablations limits the assessed impact.

major comments (2)
  1. [Abstract] Abstract and method description: the collapse of distinct rewards into identical advantages under GRPO is asserted without any equations, derivation, or explicit advantage formula showing how normalization produces this outcome; this makes the central motivation load-bearing yet unverifiable from the provided text.
  2. [Experiments] Experiments section: no scale-variation ablations or variance-heterogeneity tests are reported despite the skeptic concern that per-reward z-scores can still distort the composite advantage when reward dynamic ranges differ; without these, the claim that GDPO faithfully preserves differences remains untested.
minor comments (2)
  1. Define acronyms (GRPO, GDPO) on first use and ensure consistent notation for advantage and normalization terms throughout.
  2. Add error bars, run counts, or statistical tests to the reported outperformance metrics to strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments have helped us identify areas where the presentation can be improved for clarity and rigor. Below, we provide point-by-point responses to the major comments, outlining the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the collapse of distinct rewards into identical advantages under GRPO is asserted without any equations, derivation, or explicit advantage formula showing how normalization produces this outcome; this makes the central motivation load-bearing yet unverifiable from the provided text.

    Authors: We thank the referee for pointing this out. To address this, we will revise the method description to include the explicit GRPO advantage formula and a derivation demonstrating the collapse of distinct multi-reward combinations into identical advantages. This will make the motivation verifiable and strengthen the paper's foundation. revision: yes

  2. Referee: [Experiments] Experiments section: no scale-variation ablations or variance-heterogeneity tests are reported despite the skeptic concern that per-reward z-scores can still distort the composite advantage when reward dynamic ranges differ; without these, the claim that GDPO faithfully preserves differences remains untested.

    Authors: We acknowledge the need for more rigorous testing of GDPO under varying reward scales. In the revised manuscript, we will include additional ablation experiments that vary the dynamic ranges and variances of the individual rewards to verify that GDPO preserves relative differences better than GRPO. These will be presented in the experiments section or appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: GDPO is an independent algorithmic adjustment with no reduction to fitted inputs or self-citations

full rationale

The paper identifies an empirical failure mode of GRPO under multi-reward rollouts (collapse of advantage values) and proposes GDPO as a direct decoupling of per-reward normalization. No equations, derivations, or parameter-fitting steps are shown that would make the claimed improvement equivalent to a quantity defined by the method itself. The central claim rests on the algorithmic description rather than any self-citation chain, uniqueness theorem, or ansatz imported from prior work by the same authors. The derivation chain is therefore self-contained as a proposed fix to an observed limitation, with no load-bearing step that reduces by construction to the inputs or outputs of GDPO.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the core addition is the decoupling step whose implementation details and any implicit scaling choices are not stated.

pith-pipeline@v0.9.0 · 5582 in / 1057 out tokens · 48092 ms · 2026-05-15T05:26:51.583532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...

  3. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  4. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  5. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

  6. BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...

  7. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  8. RVPO: Risk-Sensitive Alignment via Variance Regularization

    cs.LG 2026-05 unverdicted novelty 6.0

    RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.

  9. AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

    cs.CV 2026-04 unverdicted novelty 6.0

    AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.

  10. Target Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.

  11. Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

    cs.AI 2026-05 unverdicted novelty 5.0

    MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.

  12. MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

    cs.CL 2026-05 unverdicted novelty 5.0

    MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.

  13. Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

    cs.AI 2026-05 unverdicted novelty 5.0

    Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.

  14. Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

    cs.CR 2026-05 unverdicted novelty 5.0

    Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gain...

  15. LASER: Learning Active Sensing for Continuum Field Reconstruction

    cs.LG 2026-04 unverdicted novelty 5.0

    LASER trains a reinforcement learning policy inside a latent dynamics model to choose sensor placements that improve reconstruction of continuum fields under sparsity.

  16. Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

    cs.SD 2026-04 unverdicted novelty 5.0

    A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.

  17. TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning

    eess.SP 2026-04 unverdicted novelty 5.0

    TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.

  18. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  19. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 19 Pith papers · 17 internal anchors

  1. [1]

    Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

    Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

  2. [2]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  3. [4]

    Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 37:108877–108901, 2024

    Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 37:108877–108901, 2024

  4. [5]

    Grpo-care: Consistency- aware reinforcement learning for multimodal reasoning, 2025

    Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo-care: Consistency- aware reinforcement learning for multimodal reasoning, 2025

  5. [6]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  6. [7]

    Genderalign: An alignment dataset for mitigating gender bias in large language models

    Tao Zhang, Ziqian Zeng, YuxiangXiao YuxiangXiao, Huiping Zhuang, Cen Chen, James R Foulds, and Shimei Pan. Genderalign: An alignment dataset for mitigating gender bias in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11293–11311, 2025

  7. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [10]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models.arXiv preprint arXiv:2501.03262, 2025

  9. [11]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  10. [12]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

  11. [13]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  12. [15]

    Toolace: Winning the points of llm function calling

    Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

  13. [16]

    Hammer: Robust function-calling for on-device language models via function masking

    Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al. Hammer: Robust function-calling for on-device language models via function masking. arXiv preprint arXiv:2410.04587, 2024

  14. [17]

    xlam: A family of large action models to empower ai agent systems

    Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Quoc Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. xlam: A family of large action models to empower ai agent systems. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techno...

  15. [18]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  16. [19]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  17. [20]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning

  18. [21]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  19. [22]

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl.https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1- 5B-Model-by-Scaling-RL-19681902c1468005bed8ca303...

  20. [23]

    American invitational mathematics examination - aime.In American Invitational Mathematics Examination - AIME 2024, 2024

    MAA. American invitational mathematics examination - aime.In American Invitational Mathematics Examination - AIME 2024, 2024

  21. [24]

    American invitational mathematics examination - amc.In American Invitational Mathematics Examination - AMC, 2024

    MAA. American invitational mathematics examination - amc.In American Invitational Mathematics Examination - AMC, 2024

  22. [25]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  23. [26]

    Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  24. [27]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

  25. [28]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  26. [29]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938, 2021

  27. [30]

    Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

  28. [31]

    Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

    Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

  29. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300, 2024

  30. [33]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.ArXiv, abs/2507.18071, 2025

  31. [34]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...

  32. [35]

    Sample more to think less: Group filtered policy optimization for concise reasoning

    Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Singh Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. ArXiv, abs/2508.09726, 2025

  33. [36]

    Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.ArXiv, abs/2510.15110, 2025

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.ArXiv, abs/2510.15110, 2025

  34. [37]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.ArXiv, abs/2310.12773, 2023

  35. [38]

    Personalized soups: Per- sonalized large language model alignment via post-hoc pa- rameter merging.arXiv:2310.11564, 2023

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke S. Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.ArXiv, abs/2310.11564, 2023

  36. [39]

    Alarm: Align language models via hierarchical rewards modeling

    Yuhang Lai, Siyuan Wang, Shujun Liu, Xuanjing Huang, and Zhongyu Wei. Alarm: Align language models via hierarchical rewards modeling. InAnnual Meeting of the Association for Computational Linguistics, 2024

  37. [40]

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bing-Li Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao...

  38. [41]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

    Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

  39. [42]

    Training language models to reason efficiently.ArXiv, abs/2502.04463, 2025

    Daman Arora and Andrea Zanette. Training language models to reason efficiently.ArXiv, abs/2502.04463, 2025

  40. [43]

    Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.ArXiv, abs/2504.21370, 2025

    Jingyang Yi and Jiazheng Wang. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.ArXiv, abs/2504.21370, 2025

  41. [44]

    L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

  42. [45]

    Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.ArXiv, abs/2505.18298, 2025

    Jinyan Su and Claire Cardie. Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.ArXiv, abs/2505.18298, 2025. 17 GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization A. Training stability issue of GDPO without batch-wise advantage normal- ization 0 20 40 60 80 100 Steps 0.0 0.2 ...

  43. [46]

    name”: “Tool name

    Respond Appropriately: If a response is needed, generate one while maintaining consistency across user queries. Output Format <think>Your thoughts and reasoning</think> <tool_call> {“name”: “Tool name”, “parameters”: {“Parameter name”: “Parameter content”, “ ... ...”: “ ... ...”}} {“name”: “ ... ...”, “parameters”: {“ ... ...”: “ ... ...”, “ ... ...”: “ ....

  44. [47]

    Provide at least one of <tool_call> or <response>

    You must always include the<think> field to outline your reasoning. Provide at least one of <tool_call> or <response>. Decide whether to use <tool_call> (possibly multiple times), <response>, or both

  45. [48]

    name” field and a “parameters

    You can invoke multiple tool calls simultaneously in the<tool_call> fields. Each tool call should be a JSON object with a “name” field and a “parameters” field containing a dictionary of parameters. If no parameters are needed, leave the “parameters” field an empty dictionary

  46. [49]

    Refer to the previous dialogue records in the history, including the user’s queries, previous <tool_call>,<response>, and any tool feedback noted as<obs>(if exists). User Prompt for T oolRL T raining Dialogue History <user>{ { Initial User Input } }</user> <think>Round 1 Model Thought</think> { { Round 1 model output<tool_call>or<response>} } <obs>Round 1...