Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
Pith reviewed 2026-05-18 03:07 UTC · model grok-4.3
The pith
GRPO-SG downweights tokens likely to produce large gradients to stabilize RLVR training and improve generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRPO-SG augments the standard GRPO objective with token-specific weights derived from gradient information, thereby reducing the sharpness surrogate and tightening the generalization bound in the RLVR regime.
What carries the argument
Token-weighting scheme that downweights high-gradient tokens via probability shaping to control update sharpness.
Load-bearing premise
That downweighting tokens with large gradients will reliably shrink the sharpness surrogate and thereby improve the generalization bound in RLVR for language models.
What would settle it
An experiment that measures whether GRPO-SG actually lowers gradient-norm peaks while producing higher accuracy on held-out reasoning problems than standard GRPO.
Figures
read the original abstract
Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Sharpness-Guided Group Relative Policy Optimization (GRPO-SG), a token-weighted variant of Group Relative Policy Optimization (GRPO) for reinforcement learning with verifiable rewards (RLVR). It motivates the approach via a robustness-based generalization perspective in which generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, then introduces probability shaping to downweight tokens that produce large gradients. Experiments on mathematical reasoning, logic puzzles, and tool-augmented question answering report consistent gains over GRPO together with smoother gradient-norm trajectories.
Significance. If the claimed generalization bound and the empirical improvements hold, GRPO-SG supplies a lightweight, practical upgrade to a widely used RLVR optimizer that directly targets sharpness to stabilize training and reduce generalization gap. The method’s simplicity and the reported gradient-norm smoothing are clear strengths that could be adopted with minimal implementation cost.
major comments (2)
- [Introduction] Introduction / generalization view: The central motivation states that generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, yet no derivation or adaptation of this bound is supplied for the GRPO objective (clipped surrogate with group-relative advantages and non-differentiable verifiable rewards). Standard sharpness bounds assume Lipschitz or smoothness conditions that do not automatically transfer to this setting; the missing link is load-bearing for the token-downweighting rationale.
- [Method] Method section (probability shaping): The token-weighting rule is described as downweighting high-gradient tokens, but the precise functional form, normalization, and whether the resulting estimator remains unbiased with respect to the original GRPO advantage estimates are not shown. This detail is required to confirm that the modification does not alter the core policy-gradient properties.
minor comments (2)
- Notation for the weighting factor and the gradient-norm surrogate should be introduced with an explicit equation rather than prose description only.
- [Experiments] Gradient-norm trajectory plots would be clearer if they included shaded standard-deviation bands across runs.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications where possible and indicating the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Introduction] Introduction / generalization view: The central motivation states that generalization loss is upper-bounded by empirical loss plus a gradient-norm sharpness surrogate, yet no derivation or adaptation of this bound is supplied for the GRPO objective (clipped surrogate with group-relative advantages and non-differentiable verifiable rewards). Standard sharpness bounds assume Lipschitz or smoothness conditions that do not automatically transfer to this setting; the missing link is load-bearing for the token-downweighting rationale.
Authors: We acknowledge that the manuscript motivates the approach via a robustness-based generalization perspective but does not supply an explicit derivation or adaptation of the bound tailored to the GRPO objective, including its clipped surrogate, group-relative advantages, and non-differentiable verifiable rewards. The bound is presented as a guiding view drawn from the literature on sharpness and generalization rather than a new theorem. We agree this link could be made more precise. In the revision we will add a short paragraph in the introduction sketching the adaptation: the gradient norm is computed on the differentiable policy component (log-probabilities), while the verifiable rewards enter only through the advantage estimates; the token downweighting is intended to reduce the sharpness surrogate term heuristically. We will also note the limitations of standard Lipschitz assumptions in this setting and frame the motivation accordingly. revision: yes
-
Referee: [Method] Method section (probability shaping): The token-weighting rule is described as downweighting high-gradient tokens, but the precise functional form, normalization, and whether the resulting estimator remains unbiased with respect to the original GRPO advantage estimates are not shown. This detail is required to confirm that the modification does not alter the core policy-gradient properties.
Authors: We thank the referee for highlighting this omission. The current manuscript describes the idea at a high level but does not provide the exact functional form or normalization. In the revised version we will insert the precise definition: within each group the token weight is w_i = softmax(-β · ||∇_θ log π_θ(o_i | q)||), where β controls the downweighting strength, followed by normalization so that weights sum to one per group. The weighted terms are then multiplied into the per-token contributions of the GRPO loss. Because the weights are computed from the current policy’s gradients (independent of the sampled advantages) and the underlying sampling distribution is unchanged, the estimator remains unbiased for the corresponding weighted policy gradient. We will include the formula, a brief unbiasedness argument, and a short discussion of how the modification preserves the core properties of the original GRPO estimator. revision: yes
Circularity Check
No circularity: generalization view motivates proposal without reducing to inputs by construction
full rationale
The paper states a robustness-based generalization perspective as motivation, with the bound presented as an upper bound on generalization loss via empirical loss plus gradient-norm sharpness surrogate. GRPO-SG is then introduced as a token-weighted variant that downweights high-gradient tokens. No quoted equations or steps show the weighting rule or performance claims reducing to a fitted parameter renamed as prediction, a self-definition, or a self-citation chain that bears the load. The derivation remains independent of the target result, with experiments across tasks providing separate support; this is the common case of a self-contained proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm.
Forward citations
Cited by 1 Pith paper
-
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.
Reference graph
Works this paper leans on
-
[1]
Sharp-maml: Sharpness-aware model-agnostic meta learning.arXiv preprint arXiv:2206.03996,
Abbas, M., Xiao, Q., Chen, L., Chen, P.-Y ., and Chen, T. Sharp-maml: Sharpness-aware model-agnostic meta learning.arXiv preprint arXiv:2206.03996,
-
[2]
Association for Computational Linguistics. doi: 10.18653/v1/2022. acl-long.508. URL https://aclanthology.org/2022. acl-long.508. Cha, J., Chun, S., Lee, K., Cho, H.-C., Park, S., Lee, Y ., and Park, S. Swad: Domain generalization by seeking flat minima.Advances in Neural Information Processing Systems, 34:22405–22418,
-
[4]
Chen, X., Hsieh, C.-J., and Gong, B
URLhttps://arxiv.org/abs/2512.22255. Chen, X., Hsieh, C.-J., and Gong, B. When vision trans- formers outperform resnets without pre-training or strong data augmentations.arXiv preprint arXiv:2106.01548,
-
[5]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
KTO: Model Alignment as Prospect Theoretic Optimization
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
URL https://openreview. net/forum?id=6Tm1mposlrM. Gao, J., Xu, S., Ye, W., Liu, W., He, C., Fu, W., Mei, Z., Wang, G., and Wu, Y . On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115,
-
[8]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Guan, X., Zhang, L. L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Measuring Mathematical Problem Solving With the MATH Dataset
URL https://arxiv. org/abs/2103.03874, 2,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Ho, X., Nguyen, A.-K. D., Sugawara, S., and Aizawa, A. Constructing a multi-hop qa dataset for compre- hensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060,
work page internal anchor Pith review arXiv 2011
-
[14]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y . Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
9 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. J. Three factors influencing minima in sgd.ArXiv, abs/1711.04623,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Jiang, D., Lu, Y ., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., et al. Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,
-
[19]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Liu, A., Bai, H., Lu, Z., Sun, Y ., Kong, X., Wang, S., Shan, J., Jose, A. M., Liu, X., Wen, L., et al. Tis- dpo: Token-level importance sampling for direct prefer- ence optimization with estimated weights.arXiv preprint arXiv:2410.04350,
-
[23]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
URL https: //arxiv.org/abs/2601.05242. Ma, X., Liu, Q., Jiang, D., Zhang, G., Ma, Z., and Chen, W. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishirzi, H. When not to trust language models: Inves- tigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511,
work page internal anchor Pith review arXiv
-
[26]
Measuring and Narrowing the Compositionality Gap in Language Models
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the com- positionality gap in language models.arXiv preprint arXiv:2210.03350,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Generalized federated learning via sharpness aware mini- mization.arXiv preprint arXiv:2206.02618,
Qu, Z., Li, X., Duan, R., Liu, Y ., Tang, B., and Lu, Z. Generalized federated learning via sharpness aware mini- mization.arXiv preprint arXiv:2206.02618,
-
[28]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W. X., Fang, L., and Wen, J.-R. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Z., Zheng, Z., Calandriello, D., Cao, Y ., Tarassov, E., Munos, R., Pires, B
Tang, Y ., Guo, D. Z., Zheng, Z., Calandriello, D., Cao, Y ., Tarassov, E., Munos, R., Pires, B. Á., Valko, M., Cheng, Y ., et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,
-
[32]
Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Wang, Y ., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., et al. Reinforce- ment learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,
Xie, C., Huang, Y ., Zhang, C., Yu, D., Chen, X., Lin, B. Y ., Li, B., Ghazi, B., and Kumar, R. On memorization of large language models in logical reasoning.arXiv preprint arXiv:2410.23123,
-
[37]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Xie, T., Gao, Z., Ren, Q., Luo, H., Hong, Y ., Dai, B., Zhou, J., Qiu, K., Wu, Z., and Luo, C. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
DanceGRPO: Unleashing GRPO on Visual Generation
Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
11 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-...
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Yang, Z., Luo, X., Wang, Z., Han, D., He, Z., Li, D., and Xu, Y . Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025c. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:25...
-
[41]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yue, Y ., Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Zhao, A., Wu, Y ., Yue, Y ., Wu, T., Xu, Q., Lin, M., Wang, S., Wu, Q., Zheng, Z., and Huang, G. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Zhuang, J., Gong, B., Yuan, L., Cui, Y ., Adam, H., Dvornek, N
URL https://arxiv.org/abs/2512.01374. Zhuang, J., Gong, B., Yuan, L., Cui, Y ., Adam, H., Dvornek, N. C., Tatikonda, S., Duncan, J. S., and Liu, T. Surro- gate gap minimization improves sharpness aware train- ing. InInternational Conference on Learning Represen- tations (ICLR),
-
[46]
12 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping A
URL https://arxiv.org/abs/ 2203.08065. 12 Sharpness-Guided Group Relative Policy Optimization via Probability Shaping A. Related Work Large-Scale Reasoning Models.Large language models (LLMs) (Lambert et al., 2024; Gao et al., 2024; Team et al., 2025; Guo et al., 2025; Yang et al., 2025a) have recently made substantial advances across a wide range of NLP ...
-
[47]
scales reinforcement learning to train models that solve challenging reasoning problems and achieves state-of-the-art results on several benchmarks. Reinforcement Learning for Large Language Model.Before reasoning-centric systems such as OpenAI’s O-series (Jaech et al., 2024), reinforcement learning (RL) was most commonly used through reinforcement learni...
work page 2024
-
[48]
provided an early demonstration that RL can scale reasoning ability, and later systems such as DeepSeek-R1 (Guo et al., 2025), Kimi-2 (Team et al., 2025), and Qwen3 (Yang et al., 2025a) have matched or exceeded its performance. In particular, DeepSeek-R1 emphasizes that strong reasoning can arise from outcome-based online RL, notably with GRPO (Shao et al...
work page 2025
-
[49]
is a more recent optimization approach that targets improved generalization by explicitly accounting for loss-landscape sharpness during training. In particular, SAM optimizes the worst-case loss within a neighborhood of the current parameters, which encourages updates toward flatter regions while maintaining low training loss and better performance on un...
work page 2021
-
[50]
Building on prior work (Jin et al., 2025; Song et al., 2025), an E5 retriever (Wang et al.,
which incorporates a FAISS-based retrieval module, allowing agents to query a local knowledge base and extract the most relevant evidence for answering complex questions. Building on prior work (Jin et al., 2025; Song et al., 2025), an E5 retriever (Wang et al.,
work page 2025
-
[51]
was employed with the 2018 Wikipedia dump (Karpukhin et al.,
work page 2018
-
[52]
The agent alternates between retrieval operations and reasoning steps to form complete answers
as the indexed corpus. The agent alternates between retrieval operations and reasoning steps to form complete answers. we adopt Qwen2.5-3B (Yang et al., 2025a) and Qwen3-4B-Instruct-2507 (Yang et al., 2025a) as the base models. For this task, we use accuracy as the main reward, defined as: Rsearch(x,y) = ( 1if match(y,y g) −1otherwise (20) For evaluation,...
work page 2019
-
[53]
Table 8 shows that GRPO-SG consistently improves over GRPO across all tested backbones, indicating that our method generalizes beyond Qwen-Instruct models. This provides further evidence that the proposed token-weighted strategy is broadly applicable and not limited to a single model family. 15 Sharpness-Guided Group Relative Policy Optimization via Proba...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.