{"paper":{"title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","license":"http://creativecommons.org/licenses/by/4.0/","headline":"GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Linh Ngo Van, Trung Le, Tue Le","submitted_at":"2025-10-29T08:07:47Z","abstract_excerpt":"Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-"},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"GRPO-SG, a simple token-weighted variant of GRPO, downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization in RLVR.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm, and that downweighting high-gradient tokens will reliably reduce this sharpness in the RLVR setting for LLMs.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"9b90ce15ea5ceca22a10db72cd910009dc7570f1d5ea55c60b134457067c5ef3"},"source":{"id":"2511.00066","kind":"arxiv","version":4},"verdict":{"id":"90ca2c5f-5ba7-4e36-81e5-c4d09186bcfa","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-18T03:07:44.282901Z","strongest_claim":"GRPO-SG, a simple token-weighted variant of GRPO, downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization in RLVR.","one_line_summary":"GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm, and that downweighting high-gradient tokens will reliably reduce this sharpness in the RLVR setting for LLMs.","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}