{"paper":{"title":"CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A Coach generates tasks and rewards a Player for solving them, improving LLM math reasoning without any external data.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Jiarui Yuan, Jinyi Hu, Maosong Sun, Ran Li, Weize Chen, Yinghao Chen, Zeyuan Liu, Zhiyuan Liu, Zixuan Fu","submitted_at":"2026-02-03T01:38:53Z","abstract_excerpt":"Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPM\\\"obius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement lea"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"CPMobius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The cooperative optimization loop between Coach and Player directly enhances the Player's mathematical reasoning ability without external data or labels.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CPMobius uses iterative coach-player reinforcement learning to improve mathematical reasoning in LLMs without external training data, yielding +4.9 average accuracy gains on Qwen2.5-Math-7B-Instruct.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A Coach generates tasks and rewards a Player for solving them, improving LLM math reasoning without any external data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"af00d8d37100c7ad6926b05a420da03b492af55976d775a2a8bc476119afbf0f"},"source":{"id":"2602.02979","kind":"arxiv","version":2},"verdict":{"id":"06ab818a-5768-49d1-a65f-26930b8497bf","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:39:31.999161Z","strongest_claim":"CPMobius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4.","one_line_summary":"CPMobius uses iterative coach-player reinforcement learning to improve mathematical reasoning in LLMs without external training data, yielding +4.9 average accuracy gains on Qwen2.5-Math-7B-Instruct.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The cooperative optimization loop between Coach and Player directly enhances the Player's mathematical reasoning ability without external data or labels.","pith_extraction_headline":"A Coach generates tasks and rewards a Player for solving them, improving LLM math reasoning without any external data."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"6fb6c2b6460be6fc0ec4621098faece13313602a8189bbffeb68cb065a57e60a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}