Experience Augmented Policy Optimization for LLM Reasoning

Chiyu Ma; Guoyin Wang; Jinda Lu; Jinghan Li; Jingren Zhou; Junkang Wu; Kexin Huang; Shaohang Wei; Shuo Yang; Xiang Wang

arxiv: 2606.30420 · v1 · pith:25R6PVE3new · submitted 2026-06-29 · 💻 cs.LG

Experience Augmented Policy Optimization for LLM Reasoning

Jinda Lu , Kexin Huang , Junkang Wu , Shuo Yang , Jinghan Li , Chiyu Ma , Shaohang Wei , Xiang Wang

show 2 more authors

Guoyin Wang Jingren Zhou

This is my paper

Pith reviewed 2026-06-30 06:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords Reinforcement Learning with Verifiable RewardsLLM ReasoningPolicy OptimizationExperience ReuseImportance SamplingRLVR

0 comments

The pith

EAPO improves LLM reasoning by selectively injecting experience from a prior policy at critical rollout points with adapted importance sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing RLVR methods for LLM reasoning either optimize on-policy from scratch, which wastes sampling effort, or reuse fixed trajectories that become mismatched as the policy changes. The paper proposes that experience should instead be reused adaptively as an action-level prior drawn from an earlier optimized policy. EAPO implements this by injecting the prior experience only at selected decision points during new rollouts and applies an adapted importance sampling correction to keep the resulting updates stable and unbiased. Experiments with Qwen-2.5-math 7b and Qwen-3-8B across five benchmarks show that this yields consistent gains over current state-of-the-art RLVR baselines.

Core claim

By using a prior RL-optimized policy as an action-level experience prior and selectively injecting that experience at critical decision points during rollout together with an adapted importance sampling scheme, EAPO achieves more efficient use of accumulated experience in RLVR and produces stable policy updates that improve LLM reasoning performance.

What carries the argument

The experience-augmented rollout that selectively inserts action-level experience from a prior policy at critical decision points, combined with adapted importance sampling to preserve unbiased updates.

If this is right

Sampling costs fall because experience is reused instead of regenerating everything from scratch on every update.
Policy mismatch is avoided because experience is expressed in a policy-adaptive form rather than as fixed trajectories.
Learning remains stable because the adapted importance sampling keeps updates unbiased.
Reasoning accuracy rises consistently on math and related benchmarks for the tested model sizes.
The same mechanism can be applied to other base models beyond the two Qwen variants evaluated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective-injection idea could be tested in non-LLM sequential tasks where policies also evolve during training.
If the critical-point detection proves robust, it might reduce the frequency of full policy retraining when capabilities shift.
Applying EAPO to much larger models would test whether the efficiency gains scale with parameter count.
Combining the adaptive prior with other variance-reduction techniques might produce further sample-efficiency improvements.

Load-bearing premise

That selectively injecting prior-policy experience at chosen points during rollout plus the adapted importance sampling correction together produce stable updates without introducing new mismatch or bias.

What would settle it

Reproducing the five-benchmark experiments with the same Qwen models and finding that EAPO yields no improvement or a drop relative to the baseline RLVR methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.30420 by Chiyu Ma, Guoyin Wang, Jinda Lu, Jinghan Li, Jingren Zhou, Junkang Wu, Kexin Huang, Shaohang Wei, Shuo Yang, Xiang Wang.

**Figure 1.** Figure 1: Overview of experience-augmented reinforcement learning. (a) Standard reinforcement learning with verifiable rewards (RLVR) optimizes a base policy πBase into an RL-optimized policy πRL through iterative rollouts and policy updates. (b) Instead of training from scratch, EAPO leverages experience from a prior RL-optimized policy πRL during rollout of the current policy by resampling actions at critical deci… view at source ↗

**Figure 2.** Figure 2: Sensitivity analysis of the resampling threshold τ . (a) AIME’24 performance (Mean@32) under different values of τ , reporting the best performance achieved during training. (b) Evolution of the resampled ratio across training steps, representing the proportion of resampled tokens within the experience-augmented responses. (c) Accuracy of experience-augmented responses across training steps, reflecting the… view at source ↗

**Figure 3.** Figure 3: Influence of experience granularity and filtering strategy. (a) AIME’24 performance (Mean@32) comparison between token-level (Tokens, Pos-Only) and trajectory-level (Trajectory, Pos-Only) experience. Following (Zhan et al., 2025; Liu et al., 2025), we consider only positive samples in this comparison, where positive samples refer to trajectories whose final predictions are correct. (b) KL divergence to the… view at source ↗

**Figure 4.** Figure 4: Additional ablation results. Left: impact of experience annealing steps. Right: complementary effects of EAPO components. Effect of experience annealing. Figure 4a studies the impact of the experience annealing steps T. Across all settings, incorporating experience consistently improves performance over the baseline, confirming the general benefit of experienceguided resampling. When the annealing period … view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for improving the reasoning capabilities of large language models (LLMs). However, existing RLVR methods typically rely on on-policy optimization from scratch, resulting in high sampling costs and inefficient utilization of accumulated experience. As model capabilities and policy behaviors evolve during training, recent attempts to reuse experience via fixed reasoning trajectories further suffer from policy mismatch. Motivated by these limitations, we argue that experience in RLVR should not be reused as fixed reasoning trajectories, but instead expressed in a policy-adaptive manner. In this work, we propose Experience-Augmented Policy Optimization (EAPO), which leverages a prior RL-optimized policy as an action-level experience prior and selectively injects experience at critical decision points during rollout. To ensure stable and unbiased learning from experience-augmented rollouts, EAPO further incorporates an adapted importance sampling scheme. Experiments on using Qwen-2.5-math 7b and Qwen-3-8B on five different benchmarks demonstrate that EAPO consistently improves reasoning performance over state-of-the-art RLVR methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAPO tries adaptive experience injection from prior policies in RLVR with adapted IS, but the bias fix is underspecified and the gains hard to trust without more detail.

read the letter

The main point on this paper is that EAPO reuses experience from an earlier policy by injecting actions at critical decision points during rollouts, then applies an adapted importance sampling correction to keep updates stable. This is positioned as a fix for both high sampling costs in on-policy RLVR and the mismatch that comes from replaying fixed trajectories as policies change.

What is new is the shift to policy-adaptive, action-level injection instead of whole-trajectory reuse. The paper does a decent job naming the real inefficiency in current RLVR setups for LLM reasoning and framing the selective injection as a practical response.

The soft spots are more substantial. The abstract gives no equations or derivation for the adapted importance sampling, so it is impossible to check whether the weights properly handle both the prior-policy probability and the selective choice of injection points. The stress-test concern lands: if the correction is incomplete, the reported gains on Qwen-2.5-math 7B and Qwen-3-8B across five benchmarks could reflect bias rather than genuine improvement. No protocol details, ablations, or bias checks are visible, which makes the "consistent improvements over SOTA RLVR" claim difficult to evaluate.

This is for people working on efficient RL post-training for reasoning models who care about lowering sampling cost. A reader in that niche could extract the core idea, but the current write-up leaves the central technical claim unverified.

It deserves peer review so the importance-sampling adaptation can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes Experience-Augmented Policy Optimization (EAPO) for RLVR-based LLM reasoning. It critiques on-policy-from-scratch methods for high sampling cost and fixed-trajectory reuse for policy mismatch, instead using a prior RL-optimized policy to selectively inject actions at critical decision points during rollout, combined with an adapted importance-sampling correction claimed to yield stable, unbiased updates. Experiments with Qwen-2.5-math 7B and Qwen-3-8B on five benchmarks report consistent gains over SOTA RLVR baselines.

Significance. If the adapted importance sampling provably removes bias under selective injection, EAPO would address a genuine inefficiency in experience reuse for RLVR and could reduce sampling costs while improving reasoning performance. The multi-model, multi-benchmark empirical results would then constitute a practically relevant advance, provided the gains are not artifacts of uncorrected bias.

major comments (2)

[Abstract, §3] Abstract and §3 (method): the central claim that the adapted importance sampling 'ensures stable and unbiased learning from experience-augmented rollouts' is load-bearing for all reported gains, yet no derivation, bias analysis, or verification is supplied showing that the correction re-weights both the selective-injection probability and the policy shift at the chosen decision points. Without this, the effective objective may remain biased and the performance improvements cannot be attributed to experience augmentation.
[§4] §4 (experiments): the reported consistent improvements over SOTA RLVR methods on five benchmarks rest on the assumption that the EAPO updates are unbiased; if the importance-sampling adaptation fails under selective injection, the cross-method comparison is confounded and the empirical claim is undermined.

minor comments (2)

[§3] Notation for the 'critical decision points' and the exact form of the adapted importance weights should be defined explicitly with equations rather than prose descriptions.
[Abstract, §1] The abstract and introduction should cite the specific prior RLVR works whose fixed-trajectory reuse is being critiqued.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which correctly identify that the unbiasedness of the adapted importance sampling is central to our claims. We address each major comment below and commit to revisions that supply the requested analysis.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): the central claim that the adapted importance sampling 'ensures stable and unbiased learning from experience-augmented rollouts' is load-bearing for all reported gains, yet no derivation, bias analysis, or verification is supplied showing that the correction re-weights both the selective-injection probability and the policy shift at the chosen decision points. Without this, the effective objective may remain biased and the performance improvements cannot be attributed to experience augmentation.

Authors: We agree that the manuscript lacks an explicit derivation and bias analysis for the adapted importance sampling under selective injection. The current text describes the scheme but does not derive how the weights correct for both injection probability and policy shift. In the revision we will add a formal derivation and bias analysis to §3, together with a verification on a controlled toy problem, to establish that the objective remains unbiased. revision: yes
Referee: [§4] §4 (experiments): the reported consistent improvements over SOTA RLVR methods on five benchmarks rest on the assumption that the EAPO updates are unbiased; if the importance-sampling adaptation fails under selective injection, the cross-method comparison is confounded and the empirical claim is undermined.

Authors: The empirical comparisons do rest on the unbiasedness of the updates. By incorporating the derivation and verification described above, we will substantiate this assumption in the revised manuscript. We will also update the discussion in §4 to explicitly link the reported gains to the corrected objective and to note the supporting analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains rest on external benchmarks, not self-defined quantities

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce the claimed performance improvements to quantities defined only by the authors' own fitted parameters or prior work. EAPO is presented as a methodological extension of existing RLVR with selective injection and adapted importance sampling, with gains asserted via experiments on five benchmarks using Qwen models. No load-bearing step equates a 'prediction' to its input by construction, and the central claim remains externally falsifiable against standard RLVR baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about the stability of the adapted importance sampling scheme.

pith-pipeline@v0.9.1-grok · 5748 in / 1049 out tokens · 23613 ms · 2026-06-30T06:50:26.463273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 11 internal anchors

[1]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Reasoning with Exploration: An Entropy Perspective

Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

OpenAI o1 System Card

URL https://openreview.net/forum ?id=r6Pw3RiMYL. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Leviathan, Y ., Kalman, M., and Matias, Y

URL https://arxiv.org/ abs/2510.04140. Leviathan, Y ., Kalman, M., and Matias, Y . Fast inference from transformers via speculative decoding. InInter- national Conference on Machine Learning, pp. 19274– 19286. PMLR,

work page arXiv
[6]

Li, S., Zhou, Z., Lam, W., Yang, C., and Lu, C

URL https://openreview.net/forum?id= jyOgpu5wfC. Li, S., Zhou, Z., Lam, W., Yang, C., and Lu, C. Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025b. Liang, J., Tang, H., Ma, Y ., Liu, J., Zheng, Y ., Hu, S., Bai, L., and Hao, J. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model....

work page arXiv
[7]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Rethinking visual content refinement in low-shot clip adaptation.arXiv preprint arXiv:2407.14117,

Lu, J., Wang, S., Hao, Y ., Liu, H., Wang, X., and Wang, M. Rethinking visual content refinement in low-shot clip adaptation.arXiv preprint arXiv:2407.14117,

work page arXiv
[9]

Adavip: Aligning multi-modal llms via adaptive vision- enhanced preference optimization, 2025a

Lu, J., Li, J., Gao, Y ., Wu, J., Wu, J., Wang, X., and He, X. Adavip: Aligning multi-modal llms via adaptive vision- enhanced preference optimization, 2025a. URL https: //arxiv.org/abs/2504.15619. Lu, J., Wu, J., Li, J., Jia, X., Wang, S., Zhang, Y ., Fang, J., Wang, X., and He, X. Dama: Data-and model-aware alignment of multi-modal llms. InInternational...

work page arXiv
[10]

https://thinkingmachines.ai/blog/on-policy- distillation

doi: 10.64434/tml .20251026. https://thinkingmachines.ai/blog/on-policy- distillation. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

work page doi:10.64434/tml
[11]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Kimi K2: Open Agentic Intelligence

Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Quantile advantage estimation for entropy-safe reasoning

Wu, J., Huang, K., Wu, J., Zhang, A., Wang, X., and He, X. Quantile advantage estimation for entropy-safe reasoning. arXiv preprint arXiv:2509.22611,

work page arXiv
[16]

Qwen2.5 Technical Report

URL ht tps://openreview.net/forum?id=vO8LLo NWWk. Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024a. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Wong, and Yu Cheng

Zhan, R., Li, Y ., Wang, Z., Qu, X., Liu, D., Shao, J., Wong, D. F., and Cheng, Y . Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245,

work page arXiv
[19]

Jixiao Zhang and Chunsheng Zuo

Zhang, H., Fu, J., Zhang, J., Fu, K., Wang, Q., Zhang, F., and Zhou, G. Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025a. Zhang, K., Chen, X., Liu, B., Xue, T., Liao, Z., Liu, Z., Wang, X., Ning, Y ., Chen, Z., Fu, X., et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 20...

work page arXiv

[1] [1]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Reasoning with Exploration: An Entropy Perspective

Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

OpenAI o1 System Card

URL https://openreview.net/forum ?id=r6Pw3RiMYL. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Leviathan, Y ., Kalman, M., and Matias, Y

URL https://arxiv.org/ abs/2510.04140. Leviathan, Y ., Kalman, M., and Matias, Y . Fast inference from transformers via speculative decoding. InInter- national Conference on Machine Learning, pp. 19274– 19286. PMLR,

work page arXiv

[6] [6]

Li, S., Zhou, Z., Lam, W., Yang, C., and Lu, C

URL https://openreview.net/forum?id= jyOgpu5wfC. Li, S., Zhou, Z., Lam, W., Yang, C., and Lu, C. Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025b. Liang, J., Tang, H., Ma, Y ., Liu, J., Zheng, Y ., Hu, S., Bai, L., and Hao, J. Squeeze the soaked sponge: Efficient off-policy reinforcement finetuning for large language model....

work page arXiv

[7] [7]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Rethinking visual content refinement in low-shot clip adaptation.arXiv preprint arXiv:2407.14117,

Lu, J., Wang, S., Hao, Y ., Liu, H., Wang, X., and Wang, M. Rethinking visual content refinement in low-shot clip adaptation.arXiv preprint arXiv:2407.14117,

work page arXiv

[9] [9]

Adavip: Aligning multi-modal llms via adaptive vision- enhanced preference optimization, 2025a

Lu, J., Li, J., Gao, Y ., Wu, J., Wu, J., Wang, X., and He, X. Adavip: Aligning multi-modal llms via adaptive vision- enhanced preference optimization, 2025a. URL https: //arxiv.org/abs/2504.15619. Lu, J., Wu, J., Li, J., Jia, X., Wang, S., Zhang, Y ., Fang, J., Wang, X., and He, X. Dama: Data-and model-aware alignment of multi-modal llms. InInternational...

work page arXiv

[10] [10]

https://thinkingmachines.ai/blog/on-policy- distillation

doi: 10.64434/tml .20251026. https://thinkingmachines.ai/blog/on-policy- distillation. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

work page doi:10.64434/tml

[11] [11]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Kimi K2: Open Agentic Intelligence

Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Quantile advantage estimation for entropy-safe reasoning

Wu, J., Huang, K., Wu, J., Zhang, A., Wang, X., and He, X. Quantile advantage estimation for entropy-safe reasoning. arXiv preprint arXiv:2509.22611,

work page arXiv

[16] [16]

Qwen2.5 Technical Report

URL ht tps://openreview.net/forum?id=vO8LLo NWWk. Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024a. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Wong, and Yu Cheng

Zhan, R., Li, Y ., Wang, Z., Qu, X., Liu, D., Shao, J., Wong, D. F., and Cheng, Y . Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245,

work page arXiv

[19] [19]

Jixiao Zhang and Chunsheng Zuo

Zhang, H., Fu, J., Zhang, J., Fu, K., Wang, Q., Zhang, F., and Zhou, G. Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025a. Zhang, K., Chen, X., Liu, B., Xue, T., Liao, Z., Liu, Z., Wang, X., Ning, Y ., Chen, Z., Fu, X., et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 20...

work page arXiv