CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Bingxiang He; Jiarui Yuan; Jinyi Hu; Maosong Sun; Ran Li; Weize Chen; Yinghao Chen; Zeyuan Liu; Zhiyuan Liu; Zixuan Fu

arxiv: 2602.02979 · v2 · pith:7RRY4RV6new · submitted 2026-02-03 · 💻 cs.CL · cs.LG

CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Ran Li , Zeyuan Liu , Yinghao Chen , Bingxiang He , Jiarui Yuan , Zixuan Fu , Weize Chen , Jinyi Hu

show 2 more authors

Zhiyuan Liu Maosong Sun

This is my paper

Pith reviewed 2026-05-16 08:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords data-free reinforcement learningcoach-player collaborationmathematical reasoningLLM self-improvementunsupervised trainingiterative task generationreasoning model scaling

0 comments

The pith

A Coach generates tasks and rewards a Player for solving them, improving LLM math reasoning without any external data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CPMobius as a cooperative loop in which one model acts as Coach to create instructions matched to the other's current skill level, while the Player solves those tasks and both receive rewards tied to measured performance gains. This setup is intended to drive iterative improvement in the Player's mathematical reasoning purely through self-generated challenges rather than curated datasets or labels. A sympathetic reader would care because current scaling of reasoning models depends on ever-larger human-labeled corpora whose supply is already showing limits. If the loop works as described, training could continue to advance reasoning ability even after external data sources are exhausted.

Core claim

CPMobius treats the Coach and Player as cooperative roles inside a closed reinforcement-learning loop. The Coach proposes instructions whose difficulty is calibrated to the Player's observed capability and receives reward proportional to subsequent gains in the Player's accuracy; the Player is rewarded for correctly solving the tasks the Coach supplies. Through repeated cycles of proposal, solution, and reward assignment, the Player's mathematical reasoning improves without any external training data or human labels.

What carries the argument

The Coach-Player cooperative optimization loop, in which the Coach is rewarded for producing tasks that measurably raise the Player's performance.

If this is right

The Player model records an average accuracy gain of 4.9 points overall and 5.4 points out of distribution on standard math benchmarks.
The same model exceeds the RENT baseline by 1.5 points overall and the R-zero baseline by 4.2 points out of distribution.
All observed gains occur without access to any external training data or human-provided labels.
The method applies the same data-free loop across multiple base models and benchmark suites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cooperative loop might be adapted to non-mathematical domains if the Coach can be given a suitable way to score task difficulty and correctness.
If the loop continues to produce gains after the initial iterations, repeated application could serve as a general post-training refinement stage for any reasoning model.
Replacing the single Coach with several specialized Coaches could test whether diversity in task generation further accelerates improvement.

Load-bearing premise

The Coach can reliably generate tasks whose increasing difficulty directly strengthens the Player's reasoning ability through performance feedback alone.

What would settle it

Run the full CPMobius loop for the reported number of iterations on Qwen2.5-Math-7B-Instruct and observe that accuracy on the same held-out math benchmarks stays flat or declines relative to the starting checkpoint.

read the original abstract

Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPM\"obius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPM\"obius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPM\"obius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy. Our codebase has been released at https://github.com/thunlp/CPMobius.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CPMobius, a collaborative Coach-Player paradigm for data-free reinforcement learning to improve mathematical reasoning in LLMs. The Coach proposes targeted instructions and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving those tasks. This cooperative loop is claimed to yield gains without external training data or labels, with reported improvements of +4.9 overall accuracy and +5.4 OOD accuracy on Qwen2.5-Math-7B-Instruct, outperforming baselines such as RENT (+1.5 overall) and R-zero (+4.2 OOD). The codebase is released for reproducibility.

Significance. If the gains are shown to arise purely from the mutual-reward loop without embedded external structure in task generation, the approach would meaningfully advance data-free scaling of reasoning models. The public codebase release is a clear strength for verification. However, the significance hinges on resolving whether the Coach's instruction generation relies on any static templates or priors, as this directly affects the central data-free claim.

major comments (2)

Abstract: The central claim that CPMobius achieves improvement 'without relying on any external training data' is load-bearing but under-supported. The description states that the Coach 'proposes instructions targeted at the Player's capability' yet provides no account of Coach initialization, prompt templates, or generation process. If any fixed examples, difficulty heuristics, or domain scaffolding are present, the data-free premise does not hold and the reported +4.9 / +5.4 gains could be attributable to prompt engineering rather than the cooperative RL dynamic.
Method (implied in abstract description of rewards): The Coach reward is defined directly from observed performance deltas in the same loop. This creates a moderate circularity risk where improvements may be artifacts of the training dynamics rather than independent reasoning gains. An explicit analysis or ablation separating the reward signal from the optimization loop is needed to substantiate that the Player's mathematical reasoning ability is genuinely enhanced.

minor comments (2)

Abstract: The notation 'CPM''obius' appears to be a LaTeX formatting artifact and should be rendered consistently as CPMobius throughout.
The abstract reports concrete numerical gains but does not specify the number of runs, statistical significance tests, or exact evaluation protocol for the OOD split; these details should be added for clarity even if present in the full experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify areas where additional clarity on the data-free claim and reward design would strengthen the paper. We have revised the manuscript to expand the Method section with explicit details on Coach initialization and prompt templates, and to include new ablations addressing the reward circularity concern. Point-by-point responses follow.

read point-by-point responses

Referee: Abstract: The central claim that CPMobius achieves improvement 'without relying on any external training data' is load-bearing but under-supported. The description states that the Coach 'proposes instructions targeted at the Player's capability' yet provides no account of Coach initialization, prompt templates, or generation process. If any fixed examples, difficulty heuristics, or domain scaffolding are present, the data-free premise does not hold and the reported +4.9 / +5.4 gains could be attributable to prompt engineering rather than the cooperative RL dynamic.

Authors: We agree that the original description was insufficiently detailed on this point. In the revised manuscript we have added a dedicated subsection in the Method section that fully specifies Coach initialization (the base LLM with no additional parameters, datasets, or fine-tuning) and the precise prompt templates used for instruction generation. These templates consist of general directives that instruct the model to generate new math problems conditioned only on aggregate performance statistics from the Player (e.g., recent accuracy and error patterns); they contain no fixed examples, hand-crafted difficulty heuristics, or external domain scaffolding. The full prompts are now reproduced verbatim in the appendix. Because the Coach begins from the identical base model as the Player and receives no external data at any stage, the data-free premise is preserved; the observed gains arise from the iterative cooperative loop rather than prompt engineering. revision: yes
Referee: Method (implied in abstract description of rewards): The Coach reward is defined directly from observed performance deltas in the same loop. This creates a moderate circularity risk where improvements may be artifacts of the training dynamics rather than independent reasoning gains. An explicit analysis or ablation separating the reward signal from the optimization loop is needed to substantiate that the Player's mathematical reasoning ability is genuinely enhanced.

Authors: We acknowledge the validity of this concern. To separate the reward signal from the optimization loop, we have added an ablation in the revised Experiments section in which the Coach is given a fixed, non-adaptive reward (absolute accuracy rather than performance delta). Under this fixed-reward regime the Player exhibits substantially smaller gains, indicating that the dynamic delta-based signal is necessary for the Coach to generate appropriately targeted tasks. We further evaluate the final Player checkpoint on a completely disjoint held-out test set that never participates in the training loop; the retained accuracy improvements on this set confirm that the gains reflect genuine reasoning enhancement rather than loop-specific artifacts. These results appear in the new Table 4 and accompanying discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external accuracy benchmarks

full rationale

The paper describes a Coach-Player RL loop with rewards based on observed performance deltas and claims data-free gains measured on held-out math benchmarks (e.g., +4.9 average accuracy). No equation or step reduces the reported improvement to a tautology, fitted parameter, or self-citation chain. Coach initialization and task generation are described at the level of the interaction dynamic without importing uniqueness theorems or ansatzes from prior self-work. The result is externally falsifiable via the stated accuracy numbers and does not collapse to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that performance-based rewards in a closed coach-player loop produce genuine reasoning gains without external supervision. No explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Cooperative optimization between independent Coach and Player roles improves the Player's reasoning capability
Core premise of the paradigm stated in the abstract

invented entities (1)

Coach-Player collaborative roles no independent evidence
purpose: Enable data-free RL loop for reasoning improvement
New framing of roles introduced to replace adversarial self-play

pith-pipeline@v0.9.0 · 5604 in / 1099 out tokens · 29706 ms · 2026-05-16T08:39:31.999161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Coach proposes instructions... receives rewards based on changes in the Player's performance... RCoach_i = RPlayer_i · Δt
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

difficulty-filtered batching... 0.2 ≤ acc_i ≤ 0.8

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.