One-Way Policy Optimization for Self-Evolving LLMs

Chiyu Ma; Guoyin Wang; Jinda Lu; Jingren Zhou; Kexin Huang; Li Yuan; Shaohang Wei; Shuo Yang; Yuyang Liu

arxiv: 2605.22156 · v1 · pith:CJ4UVAN4new · submitted 2026-05-21 · 💻 cs.LG · cs.AI

One-Way Policy Optimization for Self-Evolving LLMs

Shuo Yang , Jinda Lu , Kexin Huang , Chiyu Ma , Shaohang Wei , Yuyang Liu , Guoyin Wang , Jingren Zhou

show 1 more author

Li Yuan

This is my paper

Pith reviewed 2026-05-22 07:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords One-Way Policy OptimizationRLVRself-evolving LLMsasymmetric reweightingpolicy optimizationreinforcement learninglarge language modelsratchet effect

0 comments

The pith

One-Way Policy Optimization lets LLMs self-evolve by letting the verifier set direction while the reference only scales magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that reference-policy constraints in RLVR penalize every deviation equally, which can reverse the verifier's intended direction once the policy improves past the reference and thereby block further gains. OWPO fixes this by decoupling direction from magnitude: the verifier alone decides whether an output is better or worse, while the reference policy is used only to decide how large the update should be. It applies two asymmetric rules—accelerated alignment when the policy is still inferior and gain locking when the policy is superior—then iteratively replaces the reference with the improved policy. This produces a ratchet that locks in each advance and removes dependence on any external fixed model.

Core claim

One-Way Policy Optimization (OWPO) decouples optimization direction, which is dictated solely by the verifier, from update magnitude, which is controlled by the reference policy. It applies asymmetric reweighting that accelerates alignment on inferior deviations and locks gains on superior deviations; iterative replacement of the reference then creates a ratchet effect that consolidates each improvement and enables continuous self-evolution without external reference models.

What carries the argument

Asymmetric reweighting that accelerates updates for inferior deviations and locks gains for superior deviations, combined with iterative reference-policy replacement to produce a ratchet effect.

Load-bearing premise

The verifier's binary signal continues to point in the correct direction even after the policy surpasses the current reference, and successive reference updates do not create instabilities or shifts that invalidate the reweighting rules.

What would settle it

An ablation that removes the gain-locking rule and shows that performance stops improving or becomes unstable once the policy exceeds the initial reference policy.

Figures

Figures reproduced from arXiv: 2605.22156 by Chiyu Ma, Guoyin Wang, Jinda Lu, Jingren Zhou, Kexin Huang, Li Yuan, Shaohang Wei, Shuo Yang, Yuyang Liu.

**Figure 1.** Figure 1: Comparison of RLVR paradigms. (A) Pure RLVR ensures the correct optimization direction via verifier signals but suffers from instability due to sparsity. (B) RLVR with Reference Policy (e.g., KL regularization) improves stability but introduces a direction conflict: the reference constraint can forcibly reverse reward-improving updates (Force Reversal) when the policy attempts to deviate from the prior, t… view at source ↗

**Figure 2.** Figure 2: Overview of OWPO. The pipeline decouples the optimization direction (determined by the Verifier VGT ) from the update magnitude (modulated by the Ref Policy πref). Based on the Directional Deviation δt, OWPO dynamically applies asymmetric weights wt: executing Accelerated Alignment for inferior deviations to correct lag, and Gain Locking for superior deviations to protect exploration gains. Furthermore, th… view at source ↗

**Figure 3.** Figure 3: Linearized effective progress dynamics. Left: Standard KL exhibits Force Reversal (g < 0) when deviations are large, effectively negating the reward signal. Right: OWPO maintains Unidirectional Dynamics (g ≥ ϵlow). The asymmetric profile enables Accelerated Alignment for lags (δ < 0) and Variance Reduction for gains (δ > 0), preventing any direction flip. a scaled version of the standard policy-gradient es… view at source ↗

**Figure 4.** Figure 4: Training Dynamics Comparison. The Pass@1 score curves on AIME24 during training. While distillation-based methods (OPD, MOPD) converge quickly, they tend to plateau near the reference performance. OWPO exhibits continuous improvement, significantly outperforming the baselines at convergence. differences to the underlying optimization dynamics. Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Iterative Self-Evolution. Comparison of MOPD (left) and OWPO (right) starting from a suboptimal prior (≈ 30%, grey dashed line). We employ a stage-wise bootstrapping protocol: at the end of each iteration, the best checkpoint is frozen and serves as the πref for the subsequent iteration (e.g., the final model of Iter 1 becomes πref for Iter 2). OWPO demonstrates superior efficiency, reaching ≈ 40% accuracy… view at source ↗

**Figure 6.** Figure 6: Ablation study under the suboptimal reference setting. We report the Pass@1 accuracy on AIME24 and AIME25 benchmarks. The grey dashed line indicates the baseline performance of the reference policy. We compare the full OWPO against variants removing the asymmetric design (w/o Asym), Gain Locking (w/o Locking), or Accelerated Alignment (w/o Accel). ward tighter integration, KDRL (Xu et al., 2025) unifies … view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes One-Way Policy Optimization (OWPO) for Reinforcement Learning with Verifiable Rewards (RLVR) in LLMs. It identifies that existing token-level constraints relative to a reference policy penalize deviations indiscriminately and can flip the verifier-determined direction when the policy outperforms the reference. OWPO decouples optimization direction (set by the verifier) from update magnitude (scaled by the reference) via asymmetric reweighting: Accelerated Alignment on inferior deviations and Gain Locking on superior deviations. Iterative reference updates are introduced to produce a 'Ratchet Effect' that consolidates gains and enables continuous self-evolution without external reference models. Experiments are reported to show OWPO outperforming baselines including DAPO, OPD, and MOPD.

Significance. If the central claims hold, OWPO could meaningfully advance stable self-improvement in reasoning LLMs by removing reliance on fixed external priors while maintaining directional control from verifiers. This would address a practical bottleneck in sparse-reward RLVR settings and support more efficient iterative scaling of model capabilities.

major comments (2)

[Abstract] Abstract and method description: the asymmetric reweighting rules treat the reference solely as a magnitude scaler, yet the manuscript provides no analysis or bounds on how these rules (Accelerated Alignment and Gain Locking) behave when the reference is replaced by the previous policy iterate. This is load-bearing for the ratchet-effect and self-evolution claims, because any distribution shift can change which tokens are labeled inferior versus superior and potentially reverse the intended update direction.
[Experiments] Experimental section: no ablation on verifier reliability, no measurement of KL drift under iterative reference updates, and no error analysis are reported despite the central claim that OWPO breaks the fixed-prior bottleneck. Without these, the reported outperformance over DAPO, OPD, and MOPD cannot be assessed for robustness to the distribution-shift concern.

minor comments (2)

[Abstract] The abstract would benefit from a concise statement of the reweighting coefficients or the precise condition used to classify a deviation as inferior or superior.
Notation for the reference policy update schedule and the exact form of the asymmetric loss terms should be introduced early and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the requirements for supporting the ratchet-effect and self-evolution claims in OWPO. We respond to each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the asymmetric reweighting rules treat the reference solely as a magnitude scaler, yet the manuscript provides no analysis or bounds on how these rules (Accelerated Alignment and Gain Locking) behave when the reference is replaced by the previous policy iterate. This is load-bearing for the ratchet-effect and self-evolution claims, because any distribution shift can change which tokens are labeled inferior versus superior and potentially reverse the intended update direction.

Authors: We acknowledge that the manuscript does not currently include formal analysis or bounds on the behavior of the asymmetric reweighting rules under iterative reference updates. The core design principle is that the verifier reward determines update direction independently of the reference, which serves only as a magnitude scaler; this decoupling is intended to prevent the direction flips observed in prior token-level constraint methods. Nevertheless, we agree that explicit discussion of distribution shift effects on inferior/superior token labeling is needed to substantiate the ratchet effect. In the revised manuscript we will add a dedicated analysis subsection (with a proof sketch under the assumption of a consistent verifier) showing that the intended direction is preserved across iterates. revision: yes
Referee: [Experiments] Experimental section: no ablation on verifier reliability, no measurement of KL drift under iterative reference updates, and no error analysis are reported despite the central claim that OWPO breaks the fixed-prior bottleneck. Without these, the reported outperformance over DAPO, OPD, and MOPD cannot be assessed for robustness to the distribution-shift concern.

Authors: We agree that the current experimental section lacks the requested ablations and measurements, which limits assessment of robustness to distribution shift and verifier noise. To address this, the revised version will incorporate: (i) ablations with controlled verifier noise levels, (ii) plots and tables tracking KL divergence between successive policy iterates under the iterative reference scheme, and (iii) an error analysis highlighting cases where gains fail to consolidate. These additions will directly support the claim that OWPO mitigates the fixed-prior bottleneck. revision: yes

Circularity Check

0 steps flagged

No circularity detected; OWPO presented as algorithmic innovation with independent design choices

full rationale

The paper proposes OWPO as a novel RLVR method using asymmetric reweighting (Accelerated Alignment on inferior tokens, Gain Locking on superior tokens) and iterative reference updates to create a Ratchet Effect. No equations or derivations are shown that reduce by construction to fitted parameters or self-citations. The central claims rest on the algorithmic principle of decoupling direction from magnitude, which is externally motivated by observed issues with fixed priors rather than self-referential fitting or renaming. This qualifies as self-contained algorithmic work without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the high-level description of asymmetric reweighting and the named 'Ratchet Effect'.

pith-pipeline@v0.9.0 · 5773 in / 1005 out tokens · 19720 ms · 2026-05-22T07:57:21.721903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

δt(θ)≜sgn(At)·log(πθ(yt|st)/πref(yt|st)); wt≜sg[clip(exp(−δt),εlow,εhigh)]; Accelerated Alignment for δt<0 and Gain Locking for δt>0
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ratchet Effect via periodic hard swap πref ← πθ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 14 internal anchors

[1]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax- m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., 9 One-Way Policy Optimization for Self-Evolving LLMs Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

Hochlehnert, A., Bhatnagar, H., Udandarao, V ., Albanie, S., Prabhu, A., and Bethge, M. A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

work page arXiv
[6]

On the direction of rlvr updates for llm reasoning: Identification and exploitation, 2026

Huang, K., Meng, H., Wu, J., Lu, J., Ma, C., Chen, Z., Wang, X., Ding, B., Wu, J., Wang, X., et al. On the direction of rlvr updates for llm reasoning: Identification and exploitation.arXiv preprint arXiv:2603.22117,

work page arXiv
[7]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Lyu, Y ., Wang, C., Huang, J., and Xu, T. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257,

work page doi:10.64434/tml.20251026
[10]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

Ma, C., Yang, S., Huang, K., Lu, J., Meng, H., Wang, S., Ding, B., V osoughi, S., Wang, G., and Zhou, J. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

work page arXiv
[11]

Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

Meng, H., Huang, K., Wei, S., Ma, C., Yang, S., Wang, X., Wang, G., Ding, B., and Zhou, J. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine- tuning of llms.arXiv preprint arXiv:2603.22446,

work page arXiv
[12]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Kimi K2: Open Agentic Intelligence

Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

MiMo-V2-Flash Technical Report

Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208,

Xu, H., Zhu, Q., Deng, H., Li, J., Hou, L., Wang, Y ., Shang, L., Xu, R., and Mi, F. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208,

work page arXiv
[17]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

For the training dataset, we utilize dapo-math-17kacross all main experiments

framework. For the training dataset, we utilize dapo-math-17kacross all main experiments. This choice ensures a fair and controlled comparison with the baseline methods. B.2. Model Architectures and Constraints We evaluate our method across two representative model scales and architectures:Qwen2.5-Math-7BandQwen3-8B-Base. To accommodate the varying reason...

work page arXiv 2072

[1] [1]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax- m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., 9 One-Way Policy Optimization for Self-Evolving LLMs Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

Hochlehnert, A., Bhatnagar, H., Udandarao, V ., Albanie, S., Prabhu, A., and Bethge, M. A sober look at progress in language model reasoning: Pitfalls and paths to repro- ducibility.arXiv preprint arXiv:2504.07086,

work page arXiv

[6] [6]

On the direction of rlvr updates for llm reasoning: Identification and exploitation, 2026

Huang, K., Meng, H., Wu, J., Lu, J., Ma, C., Chen, Z., Wang, X., Ding, B., Wu, J., Wang, X., et al. On the direction of rlvr updates for llm reasoning: Identification and exploitation.arXiv preprint arXiv:2603.22117,

work page arXiv

[7] [7]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Lyu, Y ., Wang, C., Huang, J., and Xu, T. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257,

work page doi:10.64434/tml.20251026

[10] [10]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

Ma, C., Yang, S., Huang, K., Lu, J., Meng, H., Wang, S., Ding, B., V osoughi, S., Wang, G., and Zhou, J. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

work page arXiv

[11] [11]

Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

Meng, H., Huang, K., Wei, S., Ma, C., Yang, S., Wang, X., Wang, G., Ding, B., and Zhou, J. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine- tuning of llms.arXiv preprint arXiv:2603.22446,

work page arXiv

[12] [12]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Kimi K2: Open Agentic Intelligence

Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

MiMo-V2-Flash Technical Report

Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208,

Xu, H., Zhu, Q., Deng, H., Li, J., Hou, L., Wang, Y ., Shang, L., Xu, R., and Mi, F. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208,

work page arXiv

[17] [17]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

For the training dataset, we utilize dapo-math-17kacross all main experiments

framework. For the training dataset, we utilize dapo-math-17kacross all main experiments. This choice ensures a fair and controlled comparison with the baseline methods. B.2. Model Architectures and Constraints We evaluate our method across two representative model scales and architectures:Qwen2.5-Math-7BandQwen3-8B-Base. To accommodate the varying reason...

work page arXiv 2072