pith. sign in

arxiv: 2602.09782 · v2 · submitted 2026-02-10 · 💻 cs.LG · cs.AI· cs.CL

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

Pith reviewed 2026-05-16 02:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords entropy controlRLVRgradient-preserving clippingimportance sampling ratiodynamic thresholdspolicy entropyLLM reasoningreinforcement learning
0
0 comments X p. Extension

The pith

Dynamic clipping thresholds based on importance sampling ratios allow precise entropy regulation in RLVR to avoid collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that entropy collapse in RLVR training of large language models can be mitigated by linking specific regions of the importance sampling ratio to entropy growth or reduction and using that link to set dynamic clipping thresholds. A sympathetic reader would care because unchecked entropy decay produces overconfident policies, low output diversity, and vanishing gradients that halt further learning. The authors first verify the ratio-to-entropy contributions both theoretically and empirically, then introduce regulation through dynamic thresholds and evaluate three families of control patterns: increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experiments show these patterns sustain learning signals and deliver higher scores on reasoning benchmarks than static baselines.

Core claim

By mapping distinct intervals of the importance sampling ratio to entropy increase versus entropy decrease, dynamic clipping thresholds can be adjusted on the fly to maintain a desired entropy trajectory throughout RLVR training, thereby preventing premature collapse while preserving gradient norms and improving final policy performance.

What carries the argument

Dynamic clipping thresholds derived from the verified entropy contributions of different importance sampling ratio regions, which replace static clipping to achieve gradient-preserving entropy regulation.

If this is right

  • The increase-then-decrease pattern keeps entropy higher early in training to support exploration before a controlled drop.
  • The decrease-increase-decrease pattern inserts a temporary entropy recovery phase to restore diversity after an initial drop.
  • Oscillatory decay supplies repeated small upward adjustments that stabilize entropy over long training horizons.
  • All three patterns reduce the incidence of vanishing gradient norms while raising accuracy on multiple verifiable-reward benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ratio-to-entropy mapping could be ported to other clipped policy-gradient algorithms that currently rely on fixed entropy bonuses.
  • Task difficulty or model scale might be used to choose among the three decay patterns automatically rather than by hand.
  • Because the thresholds depend only on observable ratio statistics, the method may reduce the need for per-run hyperparameter search.

Load-bearing premise

That the entropy effects of specific importance sampling ratio regions remain stable enough across models and tasks for dynamic thresholds to steer entropy without side effects on gradients or policy updates.

What would settle it

If training runs that apply the proposed dynamic thresholds show no measurable increase in sustained entropy or no performance lift on standard benchmarks such as math reasoning tasks compared with fixed-clipping baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2602.09782 by Fanfan Liu, Haibo Qiu, Kun Chen, Peng Shi, Siqi Yang, Wenji Mao, Zhixiong Zeng.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Visualization of PPO clipping threshold regions and probability ratios; (b) Visualization of four entropy-sensitive regions (E1–E4), categorized by the relationship between the old probability (πold) and the current probability (πθ). These regions distinguish between high (> 0.7) and low (≤ 0.3) probability states, as well as probability gains and drops; (c) Entropy dynamics curves showing how regions … view at source ↗
Figure 3
Figure 3. Figure 3: Schematic diagram of (a) dynamic upper clipping thresh￾old and (b) dynamic lower clipping threshold 4.1.1. DYNAMIC UPPER CLIPPING THRESHOLD The upper clipping threshold mainly performs gradient clip￾ping on tokens where the current policy probability is al￾ready somewhat higher than the rollout policy probability when A > 0. DAPO (Yu et al., 2025) believes that the upper clipping threshold in RL limits the… view at source ↗
Figure 4
Figure 4. Figure 4: Increase-then-Decrease Entropy control strategy. t = 0 t = T/4 t = T/2 t = 3T/4 t = T [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Decrease-Increase-Decrease Entropy control strategy. Decrease-Increase-Decrease: Unlike the ID, the DID con￾trol strategy allows the model entropy to first decrease in the first phase, controls the increase of model entropy through gradient clipping before entropy collapse, and then controls the model convergence in the second phase. • Phase I (k < T /2): The lower clipping threshold is fixed at ϵstd. We t… view at source ↗
Figure 6
Figure 6. Figure 6: Experimental curves of model entropy regulation. (1) and (2) are training experimental curves with different Clipping Thresholds. (Yang et al., 2024) on the DAPO-MATH dataset. We con￾ducted a comprehensive evaluation of mathematical per￾formance across the AIME24 (Zhang & Math-AI, 2024), AIME25 (Zhang & Math-AI, 2025), , GSM8k (Cobbe et al., 2021), AMC, MATH-500, and Olympiad (Lightman et al., 2023) benchm… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of Pass@K metrics across various methods In [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Curves showing changes in Entropy and Reward during the training process of Qwen2.5-Math-7B for various training methods on Qwen2.5-Math-7B. We can analyze some interesting phenomena: First, our entropy regulation mechanism is effective. By adjusting the clipping threshold during the training process, the change in the model’s entropy is clear. Second, our control strategy is effective. The model training … view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of entropy and validation set score curves under different phase ratios In the control strategies of Ours-ID and Ours-DID, we evenly divided the model’s training process into two parts, where the proportion of the entropy increase control part and the model performance refinement part is equal. We conducted an in-depth analysis on this. During the training process, we specified the proportion of… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of entropy change and avg clipping threshold change for the Qwen2.5-Math-7B model 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of entropy change and avg clipping threshold change for the Qwen2.5-7B model E.2. Analysis of Clipping Probability Curve 0 100 200 300 400 Step 0.0 0.2 0.4 0.6 Entropy Ours-ID 0 100 200 300 400 Step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Ours-DID 0 100 200 300 400 Step 0.0 0.2 0.4 0.6 0.8 Ours-OD 0 100 200 300 400 Step 0.0005 0.0010 0.0015 0.0020 PG Clip Frac 0 100 200 300 400 Step 0.0005 0.0010 0.0015 0.… view at source ↗
Figure 12
Figure 12. Figure 12: Graph of Model Entropy and Average Token Clipping Probability The clipping probability of the model refers to the proportion of tokens that are clipped during the model’s training process. This proportion reflects the number of tokens affected by the clipping mechanism during the model’s training. As can be seen from [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Experiment on Replacing Dynamic Clipping Threshold with Clip-Higher and Clip-Lower 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping thresholds to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse and achieve superior performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper addresses entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. It claims to theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth versus reduction, then introduces dynamic clipping thresholds (with strategies such as increase-then-decrease and oscillatory decay) to achieve flexible entropy control while preserving gradients, reporting superior benchmark performance.

Significance. If the ratio-to-entropy mapping is shown to remain valid under time-varying thresholds and the empirical gains are robust, the work could supply a principled mechanism for entropy regulation in LLM reasoning training, improving stability over static clipping baselines.

major comments (3)
  1. [§3] §3 (Theoretical verification of ratio regions): the static analysis mapping importance-sampling ratio intervals to entropy increase/decrease is presented as the foundation for dynamic thresholds, yet the derivation assumes fixed clipping bounds; once thresholds vary with training step (as in §4), the effective support of the clipped distribution changes and the gradient-preservation argument no longer follows directly from the static case.
  2. [§4] §4 (Dynamic clipping mechanism): the claim that the proposed dynamic schedules preserve the gradient flow established in the static analysis lacks an explicit bound or lemma showing that non-stationary thresholds do not shift the expectation of the clipped importance weights outside the previously analyzed regions.
  3. [Experiments] Experiments section (benchmark tables): the reported performance gains are stated without accompanying standard deviations across seeds or ablation isolating the dynamic-threshold component from other hyper-parameter changes, making it impossible to attribute improvements specifically to the entropy-control strategy.
minor comments (2)
  1. [Preliminaries] Notation for the importance ratio and clipping bounds should be introduced once in a dedicated preliminaries subsection rather than redefined inline in multiple places.
  2. [Figures] Figure captions for entropy curves should explicitly state the number of runs and whether shaded regions represent standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to strengthen the theoretical analysis and experimental reporting.

read point-by-point responses
  1. Referee: [§3] §3 (Theoretical verification of ratio regions): the static analysis mapping importance-sampling ratio intervals to entropy increase/decrease is presented as the foundation for dynamic thresholds, yet the derivation assumes fixed clipping bounds; once thresholds vary with training step (as in §4), the effective support of the clipped distribution changes and the gradient-preservation argument no longer follows directly from the static case.

    Authors: We acknowledge that the analysis in §3 is derived under fixed clipping bounds. In the revised manuscript we have added Lemma 3.2, which extends the static mapping to the time-varying case under the assumption that threshold schedules are Lipschitz continuous with small constant L. The lemma bounds the perturbation to the entropy contribution of each ratio region by O(L), which remains negligible for the slow-varying schedules we employ. The proof appears in the new Appendix B. revision: yes

  2. Referee: [§4] §4 (Dynamic clipping mechanism): the claim that the proposed dynamic schedules preserve the gradient flow established in the static analysis lacks an explicit bound or lemma showing that non-stationary thresholds do not shift the expectation of the clipped importance weights outside the previously analyzed regions.

    Authors: We agree that an explicit guarantee is required. We have inserted Lemma 4.1 in the revised §4, which shows that for the increase-then-decrease and oscillatory-decay schedules the difference in expected clipped importance weights relative to the static case is bounded by O(Δ), where Δ is the maximum per-step threshold change. Consequently the weights remain inside the entropy-increasing or entropy-reducing regions with probability at least 1-δ, preserving the gradient-flow properties established in §3. revision: yes

  3. Referee: Experiments section (benchmark tables): the reported performance gains are stated without accompanying standard deviations across seeds or ablation isolating the dynamic-threshold component from other hyper-parameter changes, making it impossible to attribute improvements specifically to the entropy-control strategy.

    Authors: We have revised the Experiments section to report means and standard deviations over five independent random seeds for every benchmark entry. We have also added a new ablation table (Table 5) that holds all other hyperparameters fixed and compares only static versus dynamic clipping, thereby isolating the contribution of the proposed entropy-control mechanism. The ablation confirms that the dynamic schedules account for the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external theoretical verification and experiments

full rationale

The paper verifies contributions of specific importance sampling ratio regions to entropy growth/reduction via theoretical analysis and empirical checks, then applies those verified regions to design dynamic clipping thresholds and entropy control schedules. No load-bearing equations reduce a prediction to a fitted parameter by construction, no self-citation chain supplies a uniqueness theorem, and no ansatz is smuggled in. The provided text contains no equations at all, and the reader's note confirms that claims rest on external verification rather than self-referential reduction. This yields a low circularity score of 2 with no steps identified.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger populated from stated claims with minimal detail.

free parameters (1)
  • dynamic clipping thresholds
    Time-varying limits chosen to target specific importance-sampling regions for entropy control.
axioms (1)
  • domain assumption Specific regions of the importance sampling ratio contribute measurably to entropy growth or reduction
    Basis for the gradient-preserving perspective and dynamic regulation mechanism.

pith-pipeline@v0.9.0 · 5494 in / 1073 out tokens · 46243 ms · 2026-05-16T02:29:40.617710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction... sgn(⟨∇θL,∇θH⟩)≈ −sgn(·[lnπθ(a|s) +H])

  • Foundation.BranchSelection branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We reformulate the clipping threshold ϵ not as a constant, but as a dynamic function of the current probability, denoted as ϵ(πθ):=f(πθ(at|st))... linear negative correlation ϵ(πθ)=α·πθ(at|st)+β

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...

  2. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 6.0

    Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 21 internal anchors

  1. [1]

    Phi-4-reasoning Technical Report

    Abdin, M., Agarwal, S., Awadallah, A., Balachandran, V ., Behl, H., Chen, L., de Rosa, G., Gunasekar, S., Javaheripi, M., Joshi, N., et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318,

  2. [2]

    Llama-nemotron: Efficient reasoning models

    Bercovich, A., Levy, I., Golan, I., Dabbah, M., El-Yaniv, R., Puny, O., Galil, I., Moshe, Z., Ronen, T., Nabwani, N., et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949,

  3. [3]

    Metis-specs: Decoupling multimodal learning via self-distilled preference-based cold start.arXiv preprint arXiv:2510.25801,

    Chen, K., Shi, P., Qiu, H., Zeng, Z., Yang, S., Mao, W., and Ma, L. Metis-specs: Decoupling multimodal learning via self-distilled preference-based cold start.arXiv preprint arXiv:2510.25801,

  4. [4]

    Reasoning with Exploration: An Entropy Perspective

    Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025a. Cheng, M., Ouyang, J., Yu, S., Yan, R., Luo, Y ., Liu, Z., Wang, D., Liu, Q., and Chen, E. Agent-r1: Training pow- erful llm agents with end-to-end reinforcement learning. arXiv preprint arX...

  5. [5]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Cui, G., Zhang, Y ., Chen, J., Yuan, L., Wang, Z., Zuo, Y ., Li, H., Fan, Y ., Chen, H., Chen, W., Liu, Z., Peng, H., Bai, L., Ouyang, W., Cheng, Y ., Zhou, B., and Ding, N. The entropy mechanism of reinforcement learning for reason- ing language models.arXiv preprint arXiv:2505.22617,

  6. [6]

    Soft Adaptive Policy Optimization

    Gao, C., Zheng, C., Chen, X.-H., Dang, K., Liu, S., Yu, B., Yang, A., Bai, S., Zhou, J., and Lin, J. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  8. [8]

    Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

    Hao, Z., Wang, H., Liu, H., Luo, J., Yu, J., Dong, H., Lin, Q., Wang, C., and Chen, J. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150,

  9. [9]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  10. [10]

    Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

    Jin, R., Gao, P., Ren, Y ., Han, Z., Zhang, T., Huang, W., Liu, W., Luan, J., and Xiong, D. Revisiting entropy in reinforcement learning for large reasoning models.arXiv preprint arXiv:2511.05993,

  11. [11]

    Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  12. [12]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, :, Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., and Chengjun. Minimax-m1: Scaling test-time com- pute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

  13. [13]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. InInternational conference on machine learning, pp. 1889–1897, 2015a. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015b. Schulman...

  14. [14]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  15. [15]

    On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493,

    Shen, H. On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493,

  16. [16]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  17. [17]

    CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

    9 Flexible Entropy Control in RLVR with Gradient-Preserving Perspective Su, Z., Pan, L., Lv, M., Li, Y ., Hu, W., Zhang, F., Gai, K., and Zhou, G. Ce-gppo: Coordinating entropy via gradient-preserving clipping policy optimization in rein- forcement learning.arXiv preprint arXiv:2509.20712,

  18. [18]

    Wang, J., Liu, R., Zhang, F., Li, X., and Zhou, G

    URL https://github.com/ modelscope/evalscope. Wang, J., Liu, R., Zhang, F., Li, X., and Zhou, G. Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr.arXiv preprint arXiv:2507.15778, 2025a. Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., Liu, Y ., Yang, A., Zhao, A., Yue, Y ., Song, S....

  19. [19]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122,

  20. [20]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  21. [21]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W.-Y ., Zhang, Y .-Q., Yan, L., Qiao, M., Wu, Y ., and Wang, M. Dapo: An o...

  22. [22]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892,

  23. [23]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Zhang, J., Huang, J., Yao, H., Liu, S., Zhang, X., Lu, S., and Tao, D. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025a. Zhang, L., Jiang, Y ., He, G., Chen, X., Lv, H., Yao, Q., Fu, F., and Chen, K. Efficient mixed-precision large language model inferen...

  24. [24]

    and Math-AI, T

    Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2025,

  25. [25]

    Group Sequence Policy Optimization

    Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., Zhou, J., and Lin, J. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

  26. [26]

    Related Work A.1

    10 Flexible Entropy Control in RLVR with Gradient-Preserving Perspective A. Related Work A.1. Reinforcement Learning and Entropy in Large Language Models Inspired by DeepSeek-R1 (Guo et al., 2025), RLVR has been extensively adopted in the post-training of LLMs, yielding a series of notable contributions (Wen et al., 2025; Huang et al., 2025; Cheng et al.,...

  27. [27]

    delineate the key factors governing entropy dynamics, including the clipping threshold, the number of offline updates, and the diversity of training data. A.2. Control of Entropy in Large Language Models Entropy is often a critical metric in RL for LLMs. To mitigate the phenomenon of entropy collapse during the RL process, numerous studies have optimized ...

  28. [28]

    X x∈V (1 + lnp x)pxδxy − X x∈V (1 + lnp x)pxpy # =−

    replaces hard clipping with a temperature-controlled smooth gating mechanism to construct a continuous trust region. Although these works attempt to control entropy by manipulating the clipping threshold, they lack a systematic understanding of how the clipping threshold regulates entropy and exhibit limited flexibility. B. Theoretical Proofs Here, we mai...

  29. [29]

    Benchmarks and Metrics We evaluate the models on a suite of mathematical reasoning benchmarks

    Table 3.Inference Sampling Hyperparameters Parameter Value Temperature 0.7 Top-p0.8 Top-k20 Batch Size 256 D.2. Benchmarks and Metrics We evaluate the models on a suite of mathematical reasoning benchmarks. The evaluation metric ismean and pass at k. The number of samples generated per problem (N) varies by dataset scale: •32 samples:AMC, AIME 2024, AIME