{"paper":{"title":"Soft Adaptive Policy Optimization","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A smooth temperature-controlled gate replaces hard clipping to stabilize reinforcement learning updates for language models.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"An Yang, Bowen Yu, Chang Gao, Chujie Zheng, Jingren Zhou, Junyang Lin, Kai Dang, Shixuan Liu, Shuai Bai, Xiong-Hui Chen","submitted_at":"2025-11-25T14:25:19Z","abstract_excerpt":"Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the smooth temperature-controlled gate selectively attenuates only harmful off-policy signals without suppressing useful learning gradients or introducing new instabilities that hard clipping avoided.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SAPO introduces smooth adaptive gating to replace hard clipping in token- and sequence-level policy optimization for more stable LLM reinforcement learning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A smooth temperature-controlled gate replaces hard clipping to stabilize reinforcement learning updates for language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"fb07efa5fb8857581e125b44350b58b7001a59c3802c6c5dd0d2da4c6721c05a"},"source":{"id":"2511.20347","kind":"arxiv","version":2},"verdict":{"id":"0483695a-675c-4f01-bace-4c1b5f8b66dc","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T07:10:00.308877Z","strongest_claim":"Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes.","one_line_summary":"SAPO introduces smooth adaptive gating to replace hard clipping in token- and sequence-level policy optimization for more stable LLM reinforcement learning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the smooth temperature-controlled gate selectively attenuates only harmful off-policy signals without suppressing useful learning gradients or introducing new instabilities that hard clipping avoided.","pith_extraction_headline":"A smooth temperature-controlled gate replaces hard clipping to stabilize reinforcement learning updates for language models."},"references":{"count":12,"sample":[{"doi":"","year":2025,"title":"Aime problems and solutions","work_id":"e9a59051-7b00-4ffc-902e-f0059e82fd2a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"The sufficiency of off-policyness and soft clipping: Ppo is still insufficient according to an off-policy measure","work_id":"0993a20c-92b1-4188-991b-f800b41bc179","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":3,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2025,"title":"HMMT . Hmmt 2025. https://www.hmmt.org, 2025","work_id":"abbec9a8-4f45-42e0-bdfa-25286b96179f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","ref_index":5,"cited_arxiv_id":"2403.07974","is_internal_anchor":true}],"resolved_work":12,"snapshot_sha256":"a08791fa11792a44147a60fbf2d95bf39d519569ca1e5fbfb604bfc4f91c6de6","internal_anchors":5},"formal_canon":{"evidence_count":2,"snapshot_sha256":"51bfdc33d4146af665e0611bc45d8f22658a7e91fff3d34f4945495e0036f036"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}