pith. machine review for the scientific record. sign in

arxiv: 2511.00066 · v4 · pith:SPDDBVRNnew · submitted 2025-10-29 · 💻 cs.LG

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Pith reviewed 2026-05-18 03:07 UTC · model grok-4.3

classification 💻 cs.LG
keywords GRPORLVRgeneralizationsharpnesstoken weightingpolicy optimizationLLM reasoningreinforcement learning
0
0 comments X

The pith

GRPO-SG downweights tokens likely to produce large gradients to stabilize RLVR training and improve generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines Group Relative Policy Optimization (GRPO) in the setting of reinforcement learning with verifiable rewards for large language models. It adopts a robustness perspective in which generalization loss is upper-bounded by a combination of empirical loss and a sharpness term measured by gradient norm. Building on this view, the authors introduce GRPO-SG, a token-weighted variant that downweights tokens expected to trigger overly large gradients. The weighting shapes the policy update to avoid sharp changes, producing smoother gradient trajectories and higher performance on mathematical reasoning, logic puzzles, and tool-augmented question answering. The method is offered as a lightweight modification that can be layered on existing GRPO implementations.

Core claim

GRPO-SG augments the standard GRPO objective with token-specific weights derived from gradient information, thereby reducing the sharpness surrogate and tightening the generalization bound in the RLVR regime.

What carries the argument

Token-weighting scheme that downweights high-gradient tokens via probability shaping to control update sharpness.

Load-bearing premise

That downweighting tokens with large gradients will reliably shrink the sharpness surrogate and thereby improve the generalization bound in RLVR for language models.

What would settle it

An experiment that measures whether GRPO-SG actually lowers gradient-norm peaks while producing higher accuracy on held-out reasoning problems than standard GRPO.

Figures

Figures reproduced from arXiv: 2511.00066 by Linh Ngo Van, Trung Le, Tue Le.

Figure 1
Figure 1. Figure 1: Gradient norm trajectories during training under GRPO vs. GRPO-SG across three RLVR settings. GRPO-SG consistently exhibits lower variability and fewer spikes than GRPO, consistent with reduced sharpness in Eq. (15) and the bound in Eq. (16) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training reward trajectories during training under GRPO vs. GRPO-SG across three RLVR settings. GRPO-SG achieves higher reward while also exhibiting lower sharpness as reflected by gradient norms. where d and d ′ are divergences between two distributions. More details can be found in Appendix D.2. Ignoring the shift terms, we can rewrite the OP in (10) as max θ EQ   1 |o| X |o| t=1 Eo≤t∼π t old(·|q)  ω … view at source ↗
Figure 2
Figure 2. Figure 2: GRPO-SG yields higher and more stable reward [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Word clouds of the top 100 high- vs. low-probability tokens selected from frequently occurring words. High-probability tokens (left) primarily consist of mathematical and logical operators, brackets, and variable names, where even small errors can invalidate an entire solution, whereas low-probability tokens (right) mostly consist of generic content words that are less critical. C.4. Ablation on Probabilit… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy on the K&K Logic Puzzles benchmark, broken down by puzzle size (3–7 people). GRPO-SG consistently achieves higher accuracy than GRPO across all difficulty levels, while the Reverse variant yields performance comparable to GRPO without clear improvement. Using KKT conditions, we have dL dπ (ot | q, o<t) = r∗ ([q, o<t], ot) − λf′  π (ot | q, o<t) πold (ot | q, o<t)  + α (ot) + β = 0 X ot π (ot | q… view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Sharpness-Guided Group Relative Policy Optimization (GRPO-SG), a token-weighted variant of Group Relative Policy Optimization (GRPO) for reinforcement learning with verifiable rewards (RLVR). It motivates the approach via a robustness-based generalization perspective in which generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, then introduces probability shaping to downweight tokens that produce large gradients. Experiments on mathematical reasoning, logic puzzles, and tool-augmented question answering report consistent gains over GRPO together with smoother gradient-norm trajectories.

Significance. If the claimed generalization bound and the empirical improvements hold, GRPO-SG supplies a lightweight, practical upgrade to a widely used RLVR optimizer that directly targets sharpness to stabilize training and reduce generalization gap. The method’s simplicity and the reported gradient-norm smoothing are clear strengths that could be adopted with minimal implementation cost.

major comments (2)
  1. [Introduction] Introduction / generalization view: The central motivation states that generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, yet no derivation or adaptation of this bound is supplied for the GRPO objective (clipped surrogate with group-relative advantages and non-differentiable verifiable rewards). Standard sharpness bounds assume Lipschitz or smoothness conditions that do not automatically transfer to this setting; the missing link is load-bearing for the token-downweighting rationale.
  2. [Method] Method section (probability shaping): The token-weighting rule is described as downweighting high-gradient tokens, but the precise functional form, normalization, and whether the resulting estimator remains unbiased with respect to the original GRPO advantage estimates are not shown. This detail is required to confirm that the modification does not alter the core policy-gradient properties.
minor comments (2)
  1. Notation for the weighting factor and the gradient-norm surrogate should be introduced with an explicit equation rather than prose description only.
  2. [Experiments] Gradient-norm trajectory plots would be clearer if they included shaded standard-deviation bands across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications where possible and indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Introduction] Introduction / generalization view: The central motivation states that generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, yet no derivation or adaptation of this bound is supplied for the GRPO objective (clipped surrogate with group-relative advantages and non-differentiable verifiable rewards). Standard sharpness bounds assume Lipschitz or smoothness conditions that do not automatically transfer to this setting; the missing link is load-bearing for the token-downweighting rationale.

    Authors: We acknowledge that the manuscript motivates the approach via a robustness-based generalization perspective but does not supply an explicit derivation or adaptation of the bound tailored to the GRPO objective, including its clipped surrogate, group-relative advantages, and non-differentiable verifiable rewards. The bound is presented as a guiding view drawn from the literature on sharpness and generalization rather than a new theorem. We agree this link could be made more precise. In the revision we will add a short paragraph in the introduction sketching the adaptation: the gradient norm is computed on the differentiable policy component (log-probabilities), while the verifiable rewards enter only through the advantage estimates; the token downweighting is intended to reduce the sharpness surrogate term heuristically. We will also note the limitations of standard Lipschitz assumptions in this setting and frame the motivation accordingly. revision: yes

  2. Referee: [Method] Method section (probability shaping): The token-weighting rule is described as downweighting high-gradient tokens, but the precise functional form, normalization, and whether the resulting estimator remains unbiased with respect to the original GRPO advantage estimates are not shown. This detail is required to confirm that the modification does not alter the core policy-gradient properties.

    Authors: We thank the referee for highlighting this omission. The current manuscript describes the idea at a high level but does not provide the exact functional form or normalization. In the revised version we will insert the precise definition: within each group the token weight is w_i = softmax(-β · ||∇_θ log π_θ(o_i | q)||), where β controls the downweighting strength, followed by normalization so that weights sum to one per group. The weighted terms are then multiplied into the per-token contributions of the GRPO loss. Because the weights are computed from the current policy’s gradients (independent of the sampled advantages) and the underlying sampling distribution is unchanged, the estimator remains unbiased for the corresponding weighted policy gradient. We will include the formula, a brief unbiasedness argument, and a short discussion of how the modification preserves the core properties of the original GRPO estimator. revision: yes

Circularity Check

0 steps flagged

No circularity: generalization view motivates proposal without reducing to inputs by construction

full rationale

The paper states a robustness-based generalization perspective as motivation, with the bound presented as an upper bound on generalization loss via empirical loss plus gradient-norm sharpness surrogate. GRPO-SG is then introduced as a token-weighted variant that downweights high-gradient tokens. No quoted equations or steps show the weighting rule or performance claims reducing to a fitted parameter renamed as prediction, a self-definition, or a self-citation chain that bears the load. The derivation remains independent of the target result, with experiments across tasks providing separate support; this is the common case of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a robustness-based generalization perspective that treats gradient norm as a valid sharpness surrogate; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm.
    This view is invoked to justify the need for sharpness control in GRPO training.

pith-pipeline@v0.9.0 · 5688 in / 1252 out tokens · 35128 ms · 2026-05-18T03:07:44.282901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

    cs.CL 2026-02 unverdicted novelty 6.0

    STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [1]

    Sharp-maml: Sharpness-aware model-agnostic meta learning.arXiv preprint arXiv:2206.03996,

    Abbas, M., Xiao, Q., Chen, L., Chen, P.-Y ., and Chen, T. Sharp-maml: Sharpness-aware model-agnostic meta learning.arXiv preprint arXiv:2206.03996,

  2. [2]

    In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V

    Association for Computational Linguistics. doi: 10.18653/v1/2022. acl-long.508. URL https://aclanthology.org/2022. acl-long.508. Cha, J., Chun, S., Lee, K., Cho, H.-C., Park, S., Lee, Y ., and Park, S. Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418,

  3. [4]

    Chen, X., Hsieh, C.-J., and Gong, B

    URLhttps://arxiv.org/abs/2512.22255. Chen, X., Hsieh, C.-J., and Gong, B. When vision trans- formers outperform resnets without pre-training or strong data augmentations.arXiv preprint arXiv:2106.01548,

  4. [5]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  5. [6]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

  6. [7]

    net/forum?id=6Tm1mposlrM

    URL https://openreview. net/forum?id=6Tm1mposlrM. Gao, J., Xu, S., Ye, W., Liu, W., He, C., Fu, W., Mei, Z., Wang, G., and Wu, Y . On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,

  7. [8]

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

    Guan, X., Zhang, L. L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

  8. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  9. [10]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

  10. [12]

    Measuring Mathematical Problem Solving With the MATH Dataset

    URL https://arxiv. org/abs/2103.03874, 2,

  11. [13]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Ho, X., Nguyen, A.-K. D., Sugawara, S., and Aizawa, A. Constructing a multi-hop qa dataset for compre- hensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,

  12. [14]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

  13. [15]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  14. [16]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    9 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

  15. [17]

    Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. J. Three factors influencing minima in sgd.ArXiv, abs/1711.04623,

  16. [18]

    Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,

    Jiang, D., Lu, Y ., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., et al. Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,

  17. [19]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  18. [20]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

  19. [21]

    Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  20. [22]

    M., Liu, X., Wen, L., et al

    Liu, A., Bai, H., Lu, Z., Sun, Y ., Kong, X., Wang, S., Shan, J., Jose, A. M., Liu, X., Wen, L., et al. Tis- dpo: Token-level importance sampling for direct prefer- ence optimization with estimated weights.arXiv preprint arXiv:2410.04350,

  21. [23]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

  22. [24]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    URL https: //arxiv.org/abs/2601.05242. Ma, X., Liu, Q., Jiang, D., Zhang, G., Ma, Z., and Chen, W. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

  23. [25]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishirzi, H. When not to trust language models: Inves- tigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511,

  24. [26]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the com- positionality gap in language models.arXiv preprint arXiv:2210.03350,

  25. [27]

    Generalized federated learning via sharpness aware mini- mization.arXiv preprint arXiv:2206.02618,

    Qu, Z., Li, X., Duan, R., Liu, Y ., Tang, B., and Lu, Z. Generalized federated learning via sharpness aware mini- mization.arXiv preprint arXiv:2206.02618,

  26. [28]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  27. [29]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  28. [30]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W. X., Fang, L., and Wen, J.-R. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,

  29. [31]

    Z., Zheng, Z., Calandriello, D., Cao, Y ., Tarassov, E., Munos, R., Pires, B

    Tang, Y ., Guo, D. Z., Zheng, Z., Calandriello, D., Cao, Y ., Tarassov, E., Munos, R., Pires, B. Á., Valko, M., Cheng, Y ., et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

  30. [32]

    Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

  31. [33]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

  32. [34]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,

  33. [35]

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    Wang, Y ., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., et al. Reinforce- ment learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,

  34. [36]

    On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

    Xie, C., Huang, Y ., Zhang, C., Yu, D., Chen, X., Lin, B. Y ., Li, B., Ghazi, B., and Kumar, R. On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

  35. [37]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Xie, T., Gao, Z., Ren, Q., Luo, H., Hong, Y ., Dai, B., Zhou, J., Qiu, K., Wu, Z., and Luo, C. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768,

  36. [38]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  37. [39]

    Qwen3 Technical Report

    11 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-...

  38. [40]

    Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025c

    Yang, Z., Luo, X., Wang, Z., Han, D., He, Z., Li, D., and Xu, Y . Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025c. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:25...

  39. [41]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yue, Y ., Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

  40. [42]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892,

  41. [43]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Zhao, A., Wu, Y ., Yue, Y ., Wu, T., Xu, Q., Lin, M., Wang, S., Wu, Q., Zheng, Z., and Huang, G. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

  42. [45]

    Zhuang, J., Gong, B., Yuan, L., Cui, Y ., Adam, H., Dvornek, N

    URL https://arxiv.org/abs/2512.01374. Zhuang, J., Gong, B., Yuan, L., Cui, Y ., Adam, H., Dvornek, N. C., Tatikonda, S., Duncan, J. S., and Liu, T. Surro- gate gap minimization improves sharpness aware train- ing. InInternational Conference on Learning Represen- tations (ICLR),

  43. [46]

    12 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping A

    URL https://arxiv.org/abs/ 2203.08065. 12 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping A. Related Work Large-Scale Reasoning Models.Large language models (LLMs) (Lambert et al., 2024; Gao et al., 2024; Team et al., 2025; Guo et al., 2025; Yang et al., 2025a) have recently made substantial advances across a wide range of NLP ...

  44. [47]

    scales reinforcement learning to train models that solve challenging reasoning problems and achieves state-of-the-art results on several benchmarks. Reinforcement Learning for Large Language Model.Before reasoning-centric systems such as OpenAI’s O-series (Jaech et al., 2024), reinforcement learning (RL) was most commonly used through reinforcement learni...

  45. [48]

    In particular, DeepSeek-R1 emphasizes that strong reasoning can arise from outcome-based online RL, notably with GRPO (Shao et al., 2024)

    provided an early demonstration that RL can scale reasoning ability, and later systems such as DeepSeek-R1 (Guo et al., 2025), Kimi-2 (Team et al., 2025), and Qwen3 (Yang et al., 2025a) have matched or exceeded its performance. In particular, DeepSeek-R1 emphasizes that strong reasoning can arise from outcome-based online RL, notably with GRPO (Shao et al...

  46. [49]

    is a more recent optimization approach that targets improved generalization by explicitly accounting for loss-landscape sharpness during training. In particular, SAM optimizes the worst-case loss within a neighborhood of the current parameters, which encourages updates toward flatter regions while maintaining low training loss and better performance on un...

  47. [50]

    Building on prior work (Jin et al., 2025; Song et al., 2025), an E5 retriever (Wang et al.,

    which incorporates a FAISS-based retrieval module, allowing agents to query a local knowledge base and extract the most relevant evidence for answering complex questions. Building on prior work (Jin et al., 2025; Song et al., 2025), an E5 retriever (Wang et al.,

  48. [51]

    was employed with the 2018 Wikipedia dump (Karpukhin et al.,

  49. [52]

    The agent alternates between retrieval operations and reasoning steps to form complete answers

    as the indexed corpus. The agent alternates between retrieval operations and reasoning steps to form complete answers. we adopt Qwen2.5-3B (Yang et al., 2025a) and Qwen3-4B-Instruct-2507 (Yang et al., 2025a) as the base models. For this task, we use accuracy as the main reward, defined as: Rsearch(x,y) = ( 1if match(y,y g) −1otherwise (20) For evaluation,...

  50. [53]

    output,” “particular,

    Table 8 shows that GRPO-SG consistently improves over GRPO across all tested backbones, indicating that our method generalizes beyond Qwen-Instruct models. This provides further evidence that the proposed token-weighted strategy is broadly applicable and not limited to a single model family. 15 Sharpness-Guided Group Relative Policy Optimization via Proba...