{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:BTZPDJO66QJ4W3LPEDJYN5XP7X","short_pith_number":"pith:BTZPDJO6","schema_version":"1.0","canonical_sha256":"0cf2f1a5def413cb6d6f20d386f6effdedcc3264e3e7081e8404e0fc3fbf4847","source":{"kind":"arxiv","id":"2601.05242","version":1},"attestation_state":"computed","paper":{"title":"GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Hongxu Yin, Jan Kautz, Kwang-Ting Cheng, Mingjie Liu, Min-Hung Chen, Pavlo Molchanov, Peter Belcak, Shih-Yang Liu, Shizhe Diao, Ximing Lu, Xin Dong, Yejin Choi, Yu-Chiang Frank Wang","submitted_at":"2026-01-08T18:59:24Z","abstract_excerpt":"As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize dis"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2601.05242","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2026-01-08T18:59:24Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"b6bf6385df9e528004dff7db06dadda8378dad300ba2b59eae30c311d9848d4d","abstract_canon_sha256":"d646f2c556ec36788b65f958f77a2db7de5f740eceeddfc230153cfd0e2107c8"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.386738Z","signature_b64":"BiyEUVI43PK47OX4lR80kKhdPf4TatG1Uy6H1frvTXtqfH1j38iJqMoXCKAE3Dj97xcrLGpXLJAL2QaqbCGqCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0cf2f1a5def413cb6d6f20d386f6effdedcc3264e3e7081e8404e0fc3fbf4847","last_reissued_at":"2026-05-17T23:38:53.386098Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.386098Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Hongxu Yin, Jan Kautz, Kwang-Ting Cheng, Mingjie Liu, Min-Hung Chen, Pavlo Molchanov, Peter Belcak, Shih-Yang Liu, Shizhe Diao, Ximing Lu, Xin Dong, Yejin Choi, Yu-Chiang Frank Wang","submitted_at":"2026-01-08T18:59:24Z","abstract_excerpt":"As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize dis"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That separately normalizing each reward before aggregation will faithfully preserve relative differences across reward combinations without introducing new scaling artifacts or training instabilities.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"75af564ee4c40e203c3617c2e8d40c07bde24a143f864db01334c51452065d93"},"source":{"id":"2601.05242","kind":"arxiv","version":1},"verdict":{"id":"11d7b481-8998-4c0f-b0dd-a6d36bf90dc7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:26:51.583532Z","strongest_claim":"directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure","one_line_summary":"GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That separately normalizing each reward before aggregation will faithfully preserve relative differences across reward combinations without introducing new scaling artifacts or training instabilities.","pith_extraction_headline":"Decoupling normalization of each reward in multi-reward RL prevents collapse of advantage values into identical signals."},"references":{"count":46,"sample":[{"doi":"","year":2025,"title":"Learn to reason efficiently with adaptive length-based reward shaping","work_id":"6baef149-2491-431c-a612-c9fea9b25a16","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","ref_index":2,"cited_arxiv_id":"2501.12599","is_internal_anchor":true},{"doi":"","year":2024,"title":"Rule based rewards for language model safety.Advances in Neural Information Processing Systems, 37:108877–108901, 2024","work_id":"ffeb3f81-2b5a-49b6-a612-0abc709095b6","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Grpo-care: Consistency- aware reinforcement learning for multimodal reasoning, 2025","work_id":"a856fcde-b204-44aa-9e8c-074097e7c58a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models","work_id":"07c85cc5-4086-4abc-823b-6d0f4ff784d0","ref_index":6,"cited_arxiv_id":"2512.02556","is_internal_anchor":true}],"resolved_work":46,"snapshot_sha256":"2ef5763291c9ceb1f71a4f38e1c35b92da8e4531fd7d78c77428f73e4a2bc2e1","internal_anchors":17},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2601.05242","created_at":"2026-05-17T23:38:53.386196+00:00"},{"alias_kind":"arxiv_version","alias_value":"2601.05242v1","created_at":"2026-05-17T23:38:53.386196+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2601.05242","created_at":"2026-05-17T23:38:53.386196+00:00"},{"alias_kind":"pith_short_12","alias_value":"BTZPDJO66QJ4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"BTZPDJO66QJ4W3LP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"BTZPDJO6","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2605.23281","citing_title":"DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04128","citing_title":"JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20654","citing_title":"REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18899","citing_title":"Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20164","citing_title":"Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00066","citing_title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21861","citing_title":"Spatiotemporal Continual Learning for Mobile Edge UAV Networks: Mitigating Catastrophic Forgetting","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13803","citing_title":"EvoGround: Self-Evolving Video Agents for Video Temporal Grounding","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13641","citing_title":"Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13467","citing_title":"PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13534","citing_title":"Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10268","citing_title":"MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09806","citing_title":"LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06650","citing_title":"Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05750","citing_title":"RVPO: Risk-Sensitive Alignment via Variance Regularization","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04499","citing_title":"Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22840","citing_title":"AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19355","citing_title":"LASER: Learning Active Sensing for Continuum Field Reconstruction","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18187","citing_title":"Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07276","citing_title":"Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07394","citing_title":"BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07244","citing_title":"Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06159","citing_title":"Target Policy Optimization","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16893","citing_title":"EasyVideoR1: Easier RL for Video Understanding","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00015","citing_title":"TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning","ref_index":39,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X","json":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X.json","graph_json":"https://pith.science/api/pith-number/BTZPDJO66QJ4W3LPEDJYN5XP7X/graph.json","events_json":"https://pith.science/api/pith-number/BTZPDJO66QJ4W3LPEDJYN5XP7X/events.json","paper":"https://pith.science/paper/BTZPDJO6"},"agent_actions":{"view_html":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X","download_json":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X.json","view_paper":"https://pith.science/paper/BTZPDJO6","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2601.05242&json=true","fetch_graph":"https://pith.science/api/pith-number/BTZPDJO66QJ4W3LPEDJYN5XP7X/graph.json","fetch_events":"https://pith.science/api/pith-number/BTZPDJO66QJ4W3LPEDJYN5XP7X/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X/action/storage_attestation","attest_author":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X/action/author_attestation","sign_citation":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X/action/citation_signature","submit_replication":"https://pith.science/pith/BTZPDJO66QJ4W3LPEDJYN5XP7X/action/replication_record"}},"created_at":"2026-05-17T23:38:53.386196+00:00","updated_at":"2026-05-17T23:38:53.386196+00:00"}