arxiv: 2512.05591 · v2 · submitted 2025-12-05 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Zhenpeng Su , Leiyu Pan , Minxuan Lv , Tiehua Mei , Zijia Lin , Yuntao Li , Wenping Hu , Ruiming Tang

show 2 more authors

Kun Gai Guorui Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords entropy ratio clippingreinforcement learningpolicy stabilityPPO clippinglanguage model post-trainingglobal constraintoff-policy updatestrust region

0 comments

The pith

Entropy ratio clipping imposes a bidirectional global constraint that stabilizes reinforcement learning updates beyond local PPO clipping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that off-policy training in large language models creates distribution shifts that drive policy entropy fluctuations and gradient instability because standard PPO clipping only constrains sampled actions. It introduces the entropy ratio between current and previous policies as a metric that quantifies overall exploration change across the full action distribution. By applying bidirectional clipping to this ratio, the method creates a soft global constraint that keeps successive policies from drifting too far in their uncertainty levels. A sympathetic reader would care because more stable global updates could reduce training collapses during alignment and capability improvement without extra sampling costs.

Core claim

We introduce an Entropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms, and experiments across multiple benchmarks show that ERC consistently improves performance.

What carries the argument

The Entropy Ratio Clipping (ERC) mechanism that applies upper and lower bounds to the ratio of policy entropies to enforce global distributional stability.

If this is right

Integrating ERC into DAPO and GPPO produces consistent performance gains on the evaluated benchmarks.
Global policy updates remain inside a soft trust region even when unsampled actions would otherwise shift freely.
Policy entropy values exhibit fewer large swings, which reduces associated gradient instability.
The same clipping rule can be added to other off-policy algorithms that rely on importance sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ERC might combine usefully with per-token or per-sequence clipping rules to create layered constraints.
The method could be tested in non-LLM RL domains where global entropy control matters for exploration-exploitation balance.
If the ratio clipping proves robust, it might reduce the need for frequent policy resets or heavy regularization in long training runs.

Load-bearing premise

The entropy ratio between current and previous policies is an effective global metric for relative policy exploration change whose constraint will improve stability without creating new failure modes.

What would settle it

A set of training runs in which entropy ratios are kept within the clipped bounds yet entropy fluctuations, gradient norms, or final task performance still degrade relative to the baseline PPO runs.

Figures

Figures reproduced from arXiv: 2512.05591 by Guorui Zhou, Kun Gai, Leiyu Pan, Minxuan Lv, Ruiming Tang, Tiehua Mei, Wenping Hu, Yuntao Li, Zhenpeng Su, Zijia Lin.

**Figure 1.** Figure 1: (a): Scatter plot showing the relationship between token-wise sampling probability and entropy ratio during RL training. (b): Comparison of the optimization objectives for DAPO and DAPO augmented with ERC. ERC extends the standard PPO-clip objective in DAPO by introducing an additional clipping term on the entropy ratio ρi,t, thereby enforcing a global distribution-level constraint. (c): Comparison of the … view at source ↗

**Figure 2.** Figure 2: Training dynamics of entropy, gradient norm and benchmark accuracy on DeepSeek-R1-DistillQwen-7B, comparing various baseline method with and without the proposed ERC mechanism. the proposed ERC method across multiple mathematical reasoning benchmarks. Experimental results demonstrate that, compared to existing RL baselines, integrating ERC consistently improves model performance across nearly all benchm… view at source ↗

**Figure 3.** Figure 3: Visualization of the clipping regions. Red [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Word cloud visualization of tokens unclipped [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison of ERC and KLregularized methods with varying coefficients. All methods are trained on the DS-R1-Distill-Qwen-7B model. dients for those tokens. In this scenario, ERC plays a more central role by serving as the primary stability constraint. Notably, ERC improves performance in both regimes, whether paired with PPO-clip (DAPO) or with a non-clipping method (GPPO). As shown in [PIT… view at source ↗

**Figure 7.** Figure 7: Performance comparison of ERC with entropy-regularized methods using different regularization coefficients. All methods are trained on the DS-R1- Distill-Qwen-1.5B model. 0 100 200 300 400 500 Step 56 58 60 62 AIME24 GSPO ERC-DAPO (a) AIME24 Accuracy 0 100 200 300 400 500 Step 40 42 44 46 48 AIME25 GSPO ERC-DAPO (b) AIME25 Accuracy [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison of ERC with the sequence-level clipping method. All methods are trained on the DS-R1-Distill-Qwen-7B model. range, maintaining both stability and effective exploration. 5.7 Comparison with Sequence-Level Clipping In this section, we compare ERC with a sequencelevel clipping method (Zheng et al., 2025). Following the optimal configuration of GSPO (Zheng et al., 2025), we conducted … view at source ↗

read the original abstract

Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbf{Entropy Ratio Clipping} (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ERC adds a global entropy-ratio clip to PPO variants for LLM RL stability, but the link from that scalar to per-action control on unsampled tokens stays unproven.

read the letter

ERC adds a global entropy-ratio clip to PPO variants for LLM RL stability, but the link from that scalar to per-action control on unsampled tokens stays unproven. The paper defines the ratio of current to previous policy entropy and applies bidirectional clipping on it during updates. This is meant to act as a soft constraint on overall exploration change and to cover probability shifts for actions that never appear in the current batch. They fold the rule into DAPO and GPPO and report better results on benchmarks. That is the actual new piece: a simple scalar metric layered on top of existing clipping to target global distributional drift in large-vocabulary policies. It is a practical response to a known pain point in off-policy LLM post-training, where standard PPO clip leaves unsampled tokens free to drift and entropy can swing around. The approach is straightforward to implement and test, which is a plus for people already running these pipelines. The soft spot is the missing connection between the clipped entropy ratio and actual limits on individual unsampled actions. Entropy is a sum across the entire vocabulary. It is possible to raise probability on one rare unsampled token substantially while lowering others enough to keep the ratio inside the clip bounds. The paper does not supply an inequality or derivation that ties the ratio bound to the maximum log-ratio shift on the unsampled set, so the compensation claim rests on the assumption rather than a shown property. If the full experiments include solid ablations and the gains survive controls for variance, the method could still serve as a useful heuristic. Without that, it is hard to predict when the clip will or will not prevent the instabilities it targets. This is for engineers and researchers who run RLHF-style training and want another knob for entropy stability. Practitioners who already modify PPO variants will see the most direct value. It deserves peer review because the problem is real, the proposal is concrete, and the empirical claims can be checked even if the theory needs more work.

Referee Report

2 major / 2 minor

Summary. The paper proposes Entropy Ratio Clipping (ERC) as a bidirectional soft constraint on the scalar ratio of policy entropies H(π_t)/H(π_{t-1}) between successive policies. This is integrated into DAPO and GPPO to stabilize off-policy RL updates for LLMs by addressing global distributional shifts and compensating for PPO-clip's inability to control probability changes on unsampled actions. Experiments across benchmarks are claimed to show consistent performance improvements.

Significance. If the central mechanism can be rigorously shown to bound per-action shifts on unsampled tokens, ERC would provide a practical global regularizer for high-dimensional policy spaces common in LLM post-training. This could meaningfully reduce entropy fluctuations and gradient instability without adding many free parameters, complementing existing clipping techniques.

major comments (2)

[ERC mechanism description] The manuscript does not derive or prove an inequality relating the clipped entropy ratio to the maximum |log(π_new(a)/π_old(a))| over unsampled actions a. In vocabularies with |A| > 10^4 it remains possible for the aggregate entropy ratio to stay inside the bidirectional clip bounds while large probability mass shifts occur on individual unsampled tokens, undermining the claim that ERC compensates for PPO-clip's local nature.
[Experiments] The experimental section provides no quantitative results, error bars, ablation controls on the clipping thresholds, or statistical significance tests for the claimed improvements when ERC is added to DAPO and GPPO. Without these, the central performance claim cannot be evaluated.

minor comments (2)

Notation for the entropy ratio should be defined explicitly with the base of the logarithm and the exact summation set (all actions or support of the current policy).
[Abstract] The abstract and introduction would benefit from a short statement of the precise clipping rule (e.g., the functional form of the bidirectional bounds).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and describe the changes we will make in the revised version.

read point-by-point responses

Referee: [ERC mechanism description] The manuscript does not derive or prove an inequality relating the clipped entropy ratio to the maximum |log(π_new(a)/π_old(a))| over unsampled actions a. In vocabularies with |A| > 10^4 it remains possible for the aggregate entropy ratio to stay inside the bidirectional clip bounds while large probability mass shifts occur on individual unsampled tokens, undermining the claim that ERC compensates for PPO-clip's local nature.

Authors: We agree that a formal inequality bounding the maximum per-action probability shift via the entropy ratio is not derived in the current manuscript. ERC provides a global constraint on the ratio of entropies, which serves as a proxy for the overall change in the policy's distribution and exploration behavior. This is particularly relevant for high-dimensional action spaces in LLMs where local clipping alone may not suffice. However, we acknowledge that this does not guarantee bounds on individual unsampled tokens. In the revision, we will add a theoretical discussion section that analyzes the relationship between entropy ratio clipping and distributional shifts, including potential limitations in large vocabularies. revision: partial
Referee: [Experiments] The experimental section provides no quantitative results, error bars, ablation controls on the clipping thresholds, or statistical significance tests for the claimed improvements when ERC is added to DAPO and GPPO. Without these, the central performance claim cannot be evaluated.

Authors: The manuscript reports performance improvements across benchmarks when integrating ERC into DAPO and GPPO. We recognize that the experimental presentation would benefit from more detailed quantitative reporting. We will revise the experimental section to include error bars, ablation studies on the clipping thresholds (e.g., different ratio bounds), and statistical significance tests to better support the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: ERC is introduced as a new definitional constraint on an observable ratio without reduction to fitted inputs or self-citations

full rationale

The paper proposes the entropy ratio as a global metric and defines ERC as bidirectional clipping on that ratio to address PPO-Clip limitations. No equations or steps reduce the claimed stabilization property to a fitted parameter, self-referential definition, or load-bearing self-citation. The derivation chain is a proposal of a new mechanism rather than a closed loop that forces the result by construction. This is the expected non-circular outcome for a methods paper presenting a novel constraint.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly assumes entropy ratio is a faithful global proxy for distributional shift.

pith-pipeline@v0.9.0 · 5494 in / 1016 out tokens · 32818 ms · 2026-05-17T00:49:23.669604+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose using the entropy ratio between the current and previous policies as a new global metric... introduce an Entropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio (Eq. 8-9).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ERC compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 11 internal anchors

[1]

Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, and 80 others. 2025 a . https://doi.org/10.48550/ARXIV.2504.13914 Seed1.5-thinking: Advancing superb reasoning models with reinforc...

work page doi:10.48550/arxiv.2504.13914 2025
[2]

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025 b . https://doi.org/10.48550/ARXIV.2505.16400 Acereason-nemotron: Advancing math and code reasoning through reinforcement learning . CoRR, abs/2505.16400

work page doi:10.48550/arxiv.2505.16400 2025
[3]

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. 2025. https://doi.org/10.48550/ARXIV.2506.14758 Reasoning with exploration: An entropy perspective . CoRR, abs/2506.14758

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.14758 2025
[4]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. https://doi.org/10.48550/ARXIV.2505.22617 The entropy mechanism of reinforcement learning for reasoning language models . CoRR, abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22617 2025
[5]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. https://doi.org/10.1038/S41586-025-09422-Z Deepseek-r1 incentivizes reasoning in llms through reinforcement lear...

work page doi:10.1038/s41586-025-09422-z 2025
[6]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.211 Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems . In Proceeding...

work page doi:10.18653/v1/2024.acl-long.211 2024
[7]

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. 2025. https://doi.org/10.48550/ARXIV.2505.22312 Skywork open reasoner 1 technical report . CoRR, abs/2505.22312

work page internal anchor Pith review doi:10.48550/arxiv.2505.22312 2025
[8]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, and 4 others. 2024. https://doi.org/10.48550/ARXIV.2411.15124 T \" u...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15124 2024
[9]

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. 2024. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/repo...

work page 2024
[10]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. https://openreview.net/forum?id=v8L0pN6EOi Let's verify step by step . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

work page 2024
[11]

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. 2025. https://doi.org/10.48550/ARXIV.2505.24864 Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models . CoRR, abs/2505.24864

work page doi:10.48550/arxiv.2505.24864 2025
[12]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. Notion Blog

work page 2025
[13]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/b1efd...

work page 2022
[14]

Jordan, and Philipp Moritz

John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. 2015. http://proceedings.mlr.press/v37/schulman15.html Trust region policy optimization . In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 , volume 37 of JMLR Workshop and Conference Proceedings , pages 1889-...

work page 2015
[15]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . CoRR, abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://doi.org/10.48550/ARXIV.2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . CoRR, abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[17]

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. 2025 a . https://doi.org/10.48550/ARXIV.2508.07629 Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization . CoRR, abs/2508.07629

work page doi:10.48550/arxiv.2508.07629 2025
[18]

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. 2025 b . https://doi.org/10.48550/ARXIV.2509.20712 CE-GPPO: coordinating entropy via gradient-preserving clipping policy optimization in reinforcement learning . CoRR, abs/2509.20712

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20712 2025
[19]

Ling Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, and 27 others. 2025. https://doi.org/10.48550/ARXIV.2506.14731 Ring-lite: Scalable reasoning via c3po-stabilized reinforcement learning for llms . Co...

work page doi:10.48550/arxiv.2506.14731 2025
[20]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 40 others. 2025. https://doi.org/10.48550/ARXIV.2505.09388 Qwen3 technical report . CoRR, abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[21]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and 16 others. 2025. https://doi.org/10.48550/ARXIV.2503.14476 DAPO: an open-source LLM reinforcement learning system at scale . ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[23]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong - Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025. https://doi.org/10.48550/ARXIV.2507.18071 Group sequence policy optimization . CoRR, abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.18071 2025
[24]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page