{"paper":{"title":"Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RDPO stabilizes advantages in mixed-reward reinforcement learning by normalizing magnitudes and removing correlations before aggregation.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"Jiahong Zhou, Jingang Wang, Kaiyuan Liu, Rongxiang Weng, Xin Chen, Xunliang Cai, Yang Bai, Ziyuan Zhuang","submitted_at":"2026-05-13T15:05:18Z","abstract_excerpt":"Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis wh"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That magnitude-aware quantile normalization and Mahalanobis whitening will stabilize advantages across heterogeneous rewards without discarding critical signal or introducing new biases in the specific reward distributions of the LongCat-Flash training setup.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RDPO stabilizes advantages in mixed-reward reinforcement learning by normalizing magnitudes and removing correlations before aggregation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"972ae1a76c20a2b7956de84521d7621a62b7068bf1f85c8e45fa9f265b7730a4"},"source":{"id":"2605.13641","kind":"arxiv","version":1},"verdict":{"id":"31a9043e-c6f2-4445-b53d-1950b87a7d90","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T19:20:34.498141Z","strongest_claim":"When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.","one_line_summary":"RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That magnitude-aware quantile normalization and Mahalanobis whitening will stabilize advantages across heterogeneous rewards without discarding critical signal or introducing new biases in the specific reward distributions of the LongCat-Flash training setup.","pith_extraction_headline":"RDPO stabilizes advantages in mixed-reward reinforcement learning by normalizing magnitudes and removing correlations before aggregation."},"references":{"count":17,"sample":[{"doi":"","year":null,"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","ref_index":1,"cited_arxiv_id":"2402.03300","is_internal_anchor":true},{"doi":"10.48550/arxiv.2402.03300","year":null,"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","ref_index":2,"cited_arxiv_id":"2402.03300","is_internal_anchor":true},{"doi":"10.48550/arxiv.2601.05242","year":null,"title":"GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization","work_id":"e60e0bf8-33b8-47c6-8d66-7975f5c121d4","ref_index":3,"cited_arxiv_id":"2601.05242","is_internal_anchor":true},{"doi":"10.48550/arxiv.2501.12599","year":null,"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","ref_index":4,"cited_arxiv_id":"2501.12599","is_internal_anchor":true},{"doi":"10.48550/arxiv.2503.04697","year":null,"title":"L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning","work_id":"ad7236fb-3752-48a9-b782-a384899c45a0","ref_index":5,"cited_arxiv_id":"2503.04697","is_internal_anchor":true}],"resolved_work":17,"snapshot_sha256":"1e5b013d5cda89c960531d62204d94226a0666c2a6d5024d24bb00b4b409fcbf","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}