arxiv: 2511.00066 · v4 · pith:SPDDBVRNnew · submitted 2025-10-29 · 💻 cs.LG

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Tue Le , Linh Ngo Van , Trung Le This is my paper

Pith reviewed 2026-05-18 03:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords GRPORLVRgeneralizationsharpnesstoken weightingpolicy optimizationLLM reasoningreinforcement learning

0 comments

The pith

GRPO-SG downweights tokens likely to produce large gradients to stabilize RLVR training and improve generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines Group Relative Policy Optimization (GRPO) in the setting of reinforcement learning with verifiable rewards for large language models. It adopts a robustness perspective in which generalization loss is upper-bounded by a combination of empirical loss and a sharpness term measured by gradient norm. Building on this view, the authors introduce GRPO-SG, a token-weighted variant that downweights tokens expected to trigger overly large gradients. The weighting shapes the policy update to avoid sharp changes, producing smoother gradient trajectories and higher performance on mathematical reasoning, logic puzzles, and tool-augmented question answering. The method is offered as a lightweight modification that can be layered on existing GRPO implementations.

Core claim

GRPO-SG augments the standard GRPO objective with token-specific weights derived from gradient information, thereby reducing the sharpness surrogate and tightening the generalization bound in the RLVR regime.

What carries the argument

Token-weighting scheme that downweights high-gradient tokens via probability shaping to control update sharpness.

Load-bearing premise

That downweighting tokens with large gradients will reliably shrink the sharpness surrogate and thereby improve the generalization bound in RLVR for language models.

What would settle it

An experiment that measures whether GRPO-SG actually lowers gradient-norm peaks while producing higher accuracy on held-out reasoning problems than standard GRPO.

Figures

Figures reproduced from arXiv: 2511.00066 by Linh Ngo Van, Trung Le, Tue Le.

**Figure 1.** Figure 1: Gradient norm trajectories during training under GRPO vs. GRPO-SG across three RLVR settings. GRPO-SG consistently exhibits lower variability and fewer spikes than GRPO, consistent with reduced sharpness in Eq. (15) and the bound in Eq. (16) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Training reward trajectories during training under GRPO vs. GRPO-SG across three RLVR settings. GRPO-SG achieves higher reward while also exhibiting lower sharpness as reflected by gradient norms. where d and d ′ are divergences between two distributions. More details can be found in Appendix D.2. Ignoring the shift terms, we can rewrite the OP in (10) as max θ EQ   1 |o| X |o| t=1 Eo≤t∼π t old(·|q) ω … view at source ↗

**Figure 2.** Figure 2: GRPO-SG yields higher and more stable reward [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Word clouds of the top 100 high- vs. low-probability tokens selected from frequently occurring words. High-probability tokens (left) primarily consist of mathematical and logical operators, brackets, and variable names, where even small errors can invalidate an entire solution, whereas low-probability tokens (right) mostly consist of generic content words that are less critical. C.4. Ablation on Probabilit… view at source ↗

**Figure 4.** Figure 4: Accuracy on the K&K Logic Puzzles benchmark, broken down by puzzle size (3–7 people). GRPO-SG consistently achieves higher accuracy than GRPO across all difficulty levels, while the Reverse variant yields performance comparable to GRPO without clear improvement. Using KKT conditions, we have dL dπ (ot | q, o<t) = r∗ ([q, o<t], ot) − λf′ π (ot | q, o<t) πold (ot | q, o<t) + α (ot) + β = 0 X ot π (ot | q… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRPO-SG adds a gradient-norm token weighting to GRPO with some empirical gains on reasoning tasks, but the claimed generalization bound is not derived for the GRPO objective.

read the letter

GRPO-SG is a token-weighted variant of GRPO that downweights high-gradient tokens to reduce sharpness and improve generalization in RLVR training. The paper does a solid job on the practical side. It applies this weighting in experiments on math reasoning, logic puzzles, and tool-augmented QA, showing consistent gains over standard GRPO along with smoother gradient norms. This kind of simple modification could be useful for anyone already running GRPO setups who wants a bit more stability without extra data or model changes. What is new is the specific use of gradient-norm based probability shaping in the GRPO context for verifiable reward RL. It builds on sharpness ideas but tailors them to token level in this optimizer. The main soft spot is the theoretical motivation. The generalization view assumes the loss is upper bounded by empirical loss plus gradient norm sharpness, but there's no derivation showing this holds for GRPO's clipped surrogate objective with group advantages. That makes the connection between the weighting and better generalization an assumption rather than a proven step. The experiments support the method empirically, but without more ablations or variance reporting, it's not clear how robust the gains are. This paper is for people doing RL fine-tuning on LLMs for reasoning. A reader interested in small tweaks to existing methods would find it worth a look. It has enough empirical substance to deserve a serious referee, who could push on the bound and ask for more controls. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Sharpness-Guided Group Relative Policy Optimization (GRPO-SG), a token-weighted variant of Group Relative Policy Optimization (GRPO) for reinforcement learning with verifiable rewards (RLVR). It motivates the approach via a robustness-based generalization perspective in which generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, then introduces probability shaping to downweight tokens that produce large gradients. Experiments on mathematical reasoning, logic puzzles, and tool-augmented question answering report consistent gains over GRPO together with smoother gradient-norm trajectories.

Significance. If the claimed generalization bound and the empirical improvements hold, GRPO-SG supplies a lightweight, practical upgrade to a widely used RLVR optimizer that directly targets sharpness to stabilize training and reduce generalization gap. The method’s simplicity and the reported gradient-norm smoothing are clear strengths that could be adopted with minimal implementation cost.

major comments (2)

[Introduction] Introduction / generalization view: The central motivation states that generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, yet no derivation or adaptation of this bound is supplied for the GRPO objective (clipped surrogate with group-relative advantages and non-differentiable verifiable rewards). Standard sharpness bounds assume Lipschitz or smoothness conditions that do not automatically transfer to this setting; the missing link is load-bearing for the token-downweighting rationale.
[Method] Method section (probability shaping): The token-weighting rule is described as downweighting high-gradient tokens, but the precise functional form, normalization, and whether the resulting estimator remains unbiased with respect to the original GRPO advantage estimates are not shown. This detail is required to confirm that the modification does not alter the core policy-gradient properties.

minor comments (2)

Notation for the weighting factor and the gradient-norm surrogate should be introduced with an explicit equation rather than prose description only.
[Experiments] Gradient-norm trajectory plots would be clearer if they included shaded standard-deviation bands across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications where possible and indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Introduction] Introduction / generalization view: The central motivation states that generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, yet no derivation or adaptation of this bound is supplied for the GRPO objective (clipped surrogate with group-relative advantages and non-differentiable verifiable rewards). Standard sharpness bounds assume Lipschitz or smoothness conditions that do not automatically transfer to this setting; the missing link is load-bearing for the token-downweighting rationale.

Authors: We acknowledge that the manuscript motivates the approach via a robustness-based generalization perspective but does not supply an explicit derivation or adaptation of the bound tailored to the GRPO objective, including its clipped surrogate, group-relative advantages, and non-differentiable verifiable rewards. The bound is presented as a guiding view drawn from the literature on sharpness and generalization rather than a new theorem. We agree this link could be made more precise. In the revision we will add a short paragraph in the introduction sketching the adaptation: the gradient norm is computed on the differentiable policy component (log-probabilities), while the verifiable rewards enter only through the advantage estimates; the token downweighting is intended to reduce the sharpness surrogate term heuristically. We will also note the limitations of standard Lipschitz assumptions in this setting and frame the motivation accordingly. revision: yes
Referee: [Method] Method section (probability shaping): The token-weighting rule is described as downweighting high-gradient tokens, but the precise functional form, normalization, and whether the resulting estimator remains unbiased with respect to the original GRPO advantage estimates are not shown. This detail is required to confirm that the modification does not alter the core policy-gradient properties.

Authors: We thank the referee for highlighting this omission. The current manuscript describes the idea at a high level but does not provide the exact functional form or normalization. In the revised version we will insert the precise definition: within each group the token weight is w_i = softmax(-β · ||∇_θ log π_θ(o_i | q)||), where β controls the downweighting strength, followed by normalization so that weights sum to one per group. The weighted terms are then multiplied into the per-token contributions of the GRPO loss. Because the weights are computed from the current policy’s gradients (independent of the sampled advantages) and the underlying sampling distribution is unchanged, the estimator remains unbiased for the corresponding weighted policy gradient. We will include the formula, a brief unbiasedness argument, and a short discussion of how the modification preserves the core properties of the original GRPO estimator. revision: yes

Circularity Check

0 steps flagged

No circularity: generalization view motivates proposal without reducing to inputs by construction

full rationale

The paper states a robustness-based generalization perspective as motivation, with the bound presented as an upper bound on generalization loss via empirical loss plus gradient-norm sharpness surrogate. GRPO-SG is then introduced as a token-weighted variant that downweights high-gradient tokens. No quoted equations or steps show the weighting rule or performance claims reducing to a fitted parameter renamed as prediction, a self-definition, or a self-citation chain that bears the load. The derivation remains independent of the target result, with experiments across tasks providing separate support; this is the common case of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a robustness-based generalization perspective that treats gradient norm as a valid sharpness surrogate; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm.
This view is invoked to justify the need for sharpness control in GRPO training.

pith-pipeline@v0.9.0 · 5688 in / 1252 out tokens · 35128 ms · 2026-05-18T03:07:44.282901+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
cs.CL 2026-02 unverdicted novelty 6.0

STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 31 internal anchors

[1]

Sharp-maml: Sharpness-aware model-agnostic meta learning.arXiv preprint arXiv:2206.03996,

Abbas, M., Xiao, Q., Chen, L., Chen, P.-Y ., and Chen, T. Sharp-maml: Sharpness-aware model-agnostic meta learning.arXiv preprint arXiv:2206.03996,

work page arXiv
[2]

Balkir, I

Association for Computational Linguistics. doi: 10.18653/v1/2022. acl-long.508. URL https://aclanthology.org/2022. acl-long.508. Cha, J., Chun, S., Lee, K., Cho, H.-C., Park, S., Lee, Y ., and Park, S. Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418,

work page doi:10.18653/v1/2022 2022
[4]

Chen, X., Hsieh, C.-J., and Gong, B

URLhttps://arxiv.org/abs/2512.22255. Chen, X., Hsieh, C.-J., and Gong, B. When vision trans- formers outperform resnets without pre-training or strong data augmentations.arXiv preprint arXiv:2106.01548,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

net/forum?id=6Tm1mposlrM

URL https://openreview. net/forum?id=6Tm1mposlrM. Gao, J., Xu, S., Ye, W., Liu, W., He, C., Fu, W., Mei, Z., Wang, G., and Wu, Y . On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,

work page arXiv
[8]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Guan, X., Zhang, L. L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://arxiv. org/abs/2103.03874, 2,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, X., Nguyen, A.-K. D., Sugawara, S., and Aizawa, A. Constructing a multi-hop qa dataset for compre- hensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,

work page internal anchor Pith review arXiv 2011
[14]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

9 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. J. Three factors influencing minima in sgd.ArXiv, abs/1711.04623,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,

Jiang, D., Lu, Y ., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., et al. Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,

work page arXiv
[19]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

M., Liu, X., Wen, L., et al

Liu, A., Bai, H., Lu, Z., Sun, Y ., Kong, X., Wang, S., Shan, J., Jose, A. M., Liu, X., Wen, L., et al. Tis- dpo: Token-level importance sampling for direct prefer- ence optimization with estimated weights.arXiv preprint arXiv:2410.04350,

work page arXiv
[23]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

URL https: //arxiv.org/abs/2601.05242. Ma, X., Liu, Q., Jiang, D., Zhang, G., Ma, Z., and Chen, W. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishirzi, H. When not to trust language models: Inves- tigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511,

work page internal anchor Pith review arXiv
[26]

Measuring and Narrowing the Compositionality Gap in Language Models

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the com- positionality gap in language models.arXiv preprint arXiv:2210.03350,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Generalized federated learning via sharpness aware mini- mization.arXiv preprint arXiv:2206.02618,

Qu, Z., Li, X., Duan, R., Liu, Y ., Tang, B., and Lu, Z. Generalized federated learning via sharpness aware mini- mization.arXiv preprint arXiv:2206.02618,

work page arXiv
[28]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W. X., Fang, L., and Wen, J.-R. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Z., Zheng, Z., Calandriello, D., Cao, Y ., Tarassov, E., Munos, R., Pires, B

Tang, Y ., Guo, D. Z., Zheng, Z., Calandriello, D., Cao, Y ., Tarassov, E., Munos, R., Pires, B. Á., Valko, M., Cheng, Y ., et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

work page arXiv
[32]

Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Wang, Y ., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., et al. Reinforce- ment learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

Xie, C., Huang, Y ., Zhang, C., Yu, D., Chen, X., Lin, B. Y ., Li, B., Ghazi, B., and Kumar, R. On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,

work page arXiv
[37]

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Xie, T., Gao, Z., Ren, Q., Luo, H., Hong, Y ., Dai, B., Zhou, J., Qiu, K., Wu, Z., and Luo, C. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Qwen3 Technical Report

11 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-...

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025c

Yang, Z., Luo, X., Wang, Z., Han, D., He, Z., Li, D., and Xu, Y . Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025c. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:25...

work page arXiv
[41]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yue, Y ., Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Zhao, A., Wu, Y ., Yue, Y ., Wu, T., Xu, Q., Lin, M., Wang, S., Wu, Q., Zheng, Z., and Huang, G. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Zhuang, J., Gong, B., Yuan, L., Cui, Y ., Adam, H., Dvornek, N

URL https://arxiv.org/abs/2512.01374. Zhuang, J., Gong, B., Yuan, L., Cui, Y ., Adam, H., Dvornek, N. C., Tatikonda, S., Duncan, J. S., and Liu, T. Surro- gate gap minimization improves sharpness aware train- ing. InInternational Conference on Learning Represen- tations (ICLR),

work page arXiv
[46]

12 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping A

URL https://arxiv.org/abs/ 2203.08065. 12 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping A. Related Work Large-Scale Reasoning Models.Large language models (LLMs) (Lambert et al., 2024; Gao et al., 2024; Team et al., 2025; Guo et al., 2025; Yang et al., 2025a) have recently made substantial advances across a wide range of NLP ...

work page arXiv 2024
[47]

scales reinforcement learning to train models that solve challenging reasoning problems and achieves state-of-the-art results on several benchmarks. Reinforcement Learning for Large Language Model.Before reasoning-centric systems such as OpenAI’s O-series (Jaech et al., 2024), reinforcement learning (RL) was most commonly used through reinforcement learni...

work page 2024
[48]

In particular, DeepSeek-R1 emphasizes that strong reasoning can arise from outcome-based online RL, notably with GRPO (Shao et al., 2024)

provided an early demonstration that RL can scale reasoning ability, and later systems such as DeepSeek-R1 (Guo et al., 2025), Kimi-2 (Team et al., 2025), and Qwen3 (Yang et al., 2025a) have matched or exceeded its performance. In particular, DeepSeek-R1 emphasizes that strong reasoning can arise from outcome-based online RL, notably with GRPO (Shao et al...

work page 2025
[49]

is a more recent optimization approach that targets improved generalization by explicitly accounting for loss-landscape sharpness during training. In particular, SAM optimizes the worst-case loss within a neighborhood of the current parameters, which encourages updates toward flatter regions while maintaining low training loss and better performance on un...

work page 2021
[50]

Building on prior work (Jin et al., 2025; Song et al., 2025), an E5 retriever (Wang et al.,

which incorporates a FAISS-based retrieval module, allowing agents to query a local knowledge base and extract the most relevant evidence for answering complex questions. Building on prior work (Jin et al., 2025; Song et al., 2025), an E5 retriever (Wang et al.,

work page 2025
[51]

was employed with the 2018 Wikipedia dump (Karpukhin et al.,

work page 2018
[52]

The agent alternates between retrieval operations and reasoning steps to form complete answers

as the indexed corpus. The agent alternates between retrieval operations and reasoning steps to form complete answers. we adopt Qwen2.5-3B (Yang et al., 2025a) and Qwen3-4B-Instruct-2507 (Yang et al., 2025a) as the base models. For this task, we use accuracy as the main reward, defined as: Rsearch(x,y) = ( 1if match(y,y g) −1otherwise (20) For evaluation,...

work page 2019
[53]

output,” “particular,

Table 8 shows that GRPO-SG consistently improves over GRPO across all tested backbones, indicating that our method generalizes beyond Qwen-Instruct models. This provides further evidence that the proposed token-weighted strategy is broadly applicable and not limited to a single model family. 15 Sharpness-Guided Group Relative Policy Optimization via Proba...

work page 2016