pith. sign in

arxiv: 2510.10150 · v4 · submitted 2025-10-11 · 💻 cs.LG · cs.AI

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Pith reviewed 2026-05-18 07:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords RLVRentropy collapselarge language modelsreinforcement learningentropy modulationtoken reweightingmathematical reasoning
0
0 comments X

The pith

Token-level entropy change during RLVR updates is governed by four factors that existing interventions overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a tight analytical approximation for how token entropy changes after each policy update in reinforcement learning with verifiable rewards. The approximation isolates four governing factors that together determine the direction and magnitude of entropy shift at every token. A reader would care because rapid entropy collapse restricts exploration and thereby caps the reasoning improvements RLVR is intended to produce. The work then shows that recent heuristic interventions adjust only one or two of these factors and therefore cannot fully control the dynamics, motivating a new method that reweights tokens according to the estimated entropy change at each step.

Core claim

We derive a tight analytical approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework to explain how existing methods influence entropy. This framework reveals a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweights tokens based on theoretically-estimated entropy variations.

What carries the argument

tight analytical approximation for token-level entropy change at each update step that isolates four governing factors

If this is right

  • Existing heuristic entropy interventions remain limited because they address only a subset of the four factors.
  • STEER improves entropy control by adaptively reweighting tokens according to the full set of estimated variations.
  • Better entropy maintenance produces higher performance on mathematical reasoning and coding benchmarks.
  • The four-factor framework supplies a systematic way to evaluate and design future entropy-modulation techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same approximation could be checked in RL settings that use non-verifiable rewards to test whether the four factors generalize.
  • If the approximation stays accurate at larger model scales, it could inform entropy management in other alignment methods beyond RLVR.
  • Similar closed-form entropy-change derivations might be obtainable for alternative policy-gradient estimators.

Load-bearing premise

The derived approximation for entropy change is sufficiently tight and the four identified factors comprehensively capture the relevant dynamics without significant omitted terms or interactions.

What would settle it

Direct computation of observed token entropy change on a held-out set of RLVR updates that deviates substantially from the four-factor prediction would falsify the central approximation.

Figures

Figures reproduced from arXiv: 2510.10150 by Can Wang, Hande Dong, Haoyang Liu, Hong Wang, Jian Luo, Jiarui Yu, Jiawei Chen, Qiang Lin, Zhezheng Hao.

Figure 1
Figure 1. Figure 1: Entropy change estimation in the first 10 train [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Token-level entropy change indicator δ(a|s). 3.2 ON ANALYSIS OF PHENOMENA IN ENTROPY DYNAMICS 3.2.1 ENTROPY DYNAMICS UNDER ADVANTAGE AND PROBABILITY To dissect the factors governing token-level entropy change, we first need to decompose the first￾order estimation Ωi,t from Theorem 1. To this end, we define a token-level entropy change indi￾cator δ(a|s) as: δ(a|s) = −πθ(a|s)(1 − πθ(a|s))2 (log(πθ(a|s)) + H(… view at source ↗
Figure 5
Figure 5. Figure 5: Key Considerations in Current Approaches. This allows us to express the entropy change from Theorem 1 as Ωi,t = η Ea∼πold(·|si,t) [ Iclip A(a|si,t) πold(a|si,t) · δ(a|si,t)]. (9) The key insight is that δ(a|s) represents the intrinsic directional tendency of the entropy change, since it only depends on the token’s generation probability πθ(a|s) and the current conditional en￾tropy H(·|s) [PITH_FULL_IMAGE:… view at source ↗
Figure 6
Figure 6. Figure 6: Four schemes to uplift entropy based on advantage and probability. To validate these theoretical findings, we con￾duct an experiment to provide empirical sup￾port. Specifically, based on the above anal￾yses, we can learn that entropy increases in two of these quadrants: (Quadrant II) when updating on low-probability tokens with posi￾tive advantages, and (Quadrant IV) when up￾dating on high-probability toke… view at source ↗
Figure 8
Figure 8. Figure 8: PSR-NSR [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The average clip counts over the first 10 steps. To verify our predictions, we conducted two experiments. First, we confirmed that clipping is indeed concentrated on low￾probability tokens, as shown by the trigger counts in [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Empirical correlation between current entropy and entropy change. 0 10 20 30 40 50 60 70 80 Step 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Entropy GTPO Entro Adv GRPO (a) Math-7B on DAPO-17k 0 10 20 30 40 50 60 70 80 Step 0.05 0.10 0.15 0.20 0.25 Entropy GTPO Entro Adv GRPO (b) Math-7B on Math [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Test set accuracy dynamics comparison with benchmarks. 0 20 40 60 80 100 120 140 Step 0.15 0.20 0.25 0.30 0.35 Accuracy min = 0.5 min = 0.6 min = 0.7 min = 0.8 [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Relationship between mean token weight and entropy change across steps. Entropy Control: The strength of our method is not only reflected in its performance but also in its ability to regulate entropy across a wide range. We consider an extreme training setup with εhigh = 5 and εlow = 0.99, where almost no ratio clipping is applied. In such scenarios, RL training 9 [PITH_FULL_IMAGE:figures/full_fig_p009_… view at source ↗
Figure 16
Figure 16. Figure 16: Weight Mapping. AIME24 AIME25 AMC23 MATH500 Minerva Olympiad 0 20 40 60 80 100 Accuracy (%) 33.5 15.6 71.1 81.4 39.6 41.0 36.9 16.2 72.2 82.4 41.7 43.3 37.0 16.3 76.3 82.2 39.3 43.5 34.8 14.8 71.3 82.2 37.7 41.4 min = 0.8 min = 0.7 min = 0.6 min = 0.5 [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Entropy Change on DAPO-Math-17k. 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 i,t 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ground-Truth Entropy Change (a) Ours on Math-1.5B 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 i,t 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ground-Truth Entropy Change (b) Ours on 7B 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 i,t 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ground-Truth Entropy Change (c) Ours on Math-7B 10 … view at source ↗
Figure 19
Figure 19. Figure 19: Entropy Change scatters on DAPO-Math-17k. [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Entropy Change on Math. 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 i,t 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ground-Truth Entropy Change (a) Ours on Math-1.5B 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 i,t 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ground-Truth Entropy Change (b) Ours on 7B 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 i,t 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ground-Truth Entropy Change (c) Ours on Math-7B 10 6 10 5 10… view at source ↗
Figure 21
Figure 21. Figure 21: Entropy Change scatters on Math. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: shows interventions applied to each quadrant with the goal of increasing entropy, using standard GRPO (εhigh=0.2, εlow=0.2) as the baseline; while [PITH_FULL_IMAGE:figures/full_fig_p019_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Decreasing entropy in four cases. compared to other entropy intervention methods and achieves the highest accuracy across all test sets [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, its training is often plagued by \emph{entropy collapse}, a rapid decline in policy entropy that limits exploration and undermines training effectiveness. While recent works attempt to mitigate this issue via several heuristic entropy interventions, the underlying mechanisms remain poorly understood. In this work, we conduct comprehensive theoretical and empirical analyses of entropy dynamics in RLVR, offering two main insights: (1) We derive a tight analytical approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework to explain how existing methods influence entropy; (2) We reveal a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweights tokens based on theoretically-estimated entropy variations. Extensive experiments across six mathematical reasoning and three coding benchmarks demonstrate that STEER effectively mitigates entropy collapse and consistently outperforms state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. It derives a tight analytical approximation for token-level entropy change at each update step, identifying four governing factors that provide a unified framework for how existing entropy interventions work; it then shows that prior methods heuristically adjust only one or two factors and proposes STEER, which adaptively reweights tokens according to the estimated entropy variations. Experiments across six mathematical reasoning and three coding benchmarks report that STEER mitigates entropy collapse and outperforms baselines.

Significance. If the approximation is tight and the four factors are comprehensive, the work supplies a principled theoretical lens on entropy dynamics that could guide future RLVR interventions. The empirical results on nine benchmarks (six math, three coding) provide concrete evidence of practical gains. The analytical derivation itself is a positive contribution when accompanied by validation.

major comments (2)
  1. [§3] §3 (theoretical derivation): the claimed tight analytical approximation for token-level ΔH appears to rest on a first-order expansion around the current policy. Without explicit remainder terms, error bounds, or analysis of quadratic/cross terms between the reward signal and entropy gradient, it is unclear whether the four factors remain exhaustive when per-token gradient norms are large, as is common in RLVR. This directly affects the central claim that the framework explains limitations of prior methods.
  2. [§4.2] §4.2 (factor validation): the paper states that the four factors comprehensively capture the dynamics, yet the text does not report a quantitative check (e.g., residual error after accounting for the four terms or ablation of omitted interactions). If higher-order terms alter sign or magnitude of predicted entropy change, the unified explanation for why heuristic interventions are incomplete would need revision.
minor comments (2)
  1. [§3] Notation for the four factors should be introduced with a single summary table or equation block so readers can track them across the theoretical and empirical sections.
  2. [Experiments] Figure 2 and Figure 3: axis labels and legends should explicitly state whether entropy is measured at token or sequence level and whether curves are averaged over the same set of prompts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. We address each major comment below with clarifications on the derivation and validation of the entropy change approximation. We are prepared to incorporate additional analysis in a revised version where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical derivation): the claimed tight analytical approximation for token-level ΔH appears to rest on a first-order expansion around the current policy. Without explicit remainder terms, error bounds, or analysis of quadratic/cross terms between the reward signal and entropy gradient, it is unclear whether the four factors remain exhaustive when per-token gradient norms are large, as is common in RLVR. This directly affects the central claim that the framework explains limitations of prior methods.

    Authors: The derivation in §3 is based on a first-order Taylor expansion of the entropy function with respect to the policy parameters, which is a standard technique for analyzing incremental updates in policy gradient methods. Under the small learning rates and per-step policy shifts typical in RLVR, this yields a tight approximation whose leading terms directly produce the four governing factors. While higher-order terms (quadratic and cross terms) exist in principle, they are second-order in the update magnitude and do not alter the qualitative identification of the dominant factors that prior heuristic methods overlook. We will add an explicit discussion of the remainder term and the regime in which the approximation remains accurate (including when gradient norms become large) to the revised manuscript. revision: partial

  2. Referee: [§4.2] §4.2 (factor validation): the paper states that the four factors comprehensively capture the dynamics, yet the text does not report a quantitative check (e.g., residual error after accounting for the four terms or ablation of omitted interactions). If higher-order terms alter sign or magnitude of predicted entropy change, the unified explanation for why heuristic interventions are incomplete would need revision.

    Authors: We agree that a direct quantitative assessment of approximation error would further strengthen the validation. The current experiments demonstrate that interventions guided by the four factors (via STEER) measurably reduce entropy collapse and improve benchmark performance, providing indirect support for the framework. In the revision we will add a quantitative residual analysis comparing observed per-token entropy changes against the four-factor prediction, together with an ablation that isolates the contribution of potential omitted interaction terms. This will allow readers to evaluate the practical tightness of the approximation under realistic RLVR gradient norms. revision: yes

Circularity Check

0 steps flagged

Derivation of token-level entropy change approximation is self-contained analytical work with no load-bearing circularity

full rationale

The paper presents a first-principles derivation of a tight analytical approximation for token-level entropy change in RLVR, identifying four governing factors directly from the policy update dynamics. This framework is used to analyze prior heuristic methods and to motivate the STEER reweighting, but the derivation itself does not reduce to fitted parameters renamed as predictions, self-citations that are load-bearing, or any self-definitional loop. The central claims remain independent of the proposed method and are grounded in standard RL entropy expressions without circular reduction to the authors' own inputs or prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on an unverified analytical approximation whose tightness is asserted but not demonstrated in the provided abstract; no explicit free parameters or invented entities are named.

axioms (1)
  • domain assumption Token-level entropy change admits a tight analytical approximation governed by four identifiable factors.
    Stated as the first main insight derived in the work.

pith-pipeline@v0.9.0 · 5758 in / 1154 out tokens · 28552 ms · 2026-05-18T07:34:01.556233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.

  2. Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

    cs.CL 2026-05 unverdicted novelty 6.0

    Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.

  3. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  4. Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    cs.LG 2026-02 unverdicted novelty 6.0

    Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

  5. Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

    cs.LG 2026-05 unverdicted novelty 5.0

    Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 5 Pith papers · 17 internal anchors

  1. [1]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gall ´e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet ¨Ust¨un, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learn- ing from human feedback in llms.arXiv preprint arXiv:2402.14740,

  2. [2]

    Reasoning with Exploration: An Entropy Perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758,

  3. [3]

    arXiv preprint arXiv:2504.02546 , year=

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546,

  4. [4]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025a. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechan...

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  6. [6]

    arXiv preprint arXiv:2505.23585 , year=

    Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585,

  7. [7]

    Rewarding the unlikely: Lifting grpo beyond distribution sharpening, 2025

    Andre He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening.arXiv preprint arXiv:2506.02355, 2025a. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-leve...

  8. [8]

    Skywork Open Reasoner 1 Technical Report

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025b. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical pro...

  9. [9]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    11 Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  10. [10]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  11. [11]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  12. [12]

    Cure: Critical-token-guided re-concatenation for entropy- collapse prevention.arXiv preprint arXiv:2508.11016,

    Qingbin Li, Rongkun Xue, Jie Wang, Ming Zhou, Zhi Li, Xiaofeng Ji, Yongqi Wang, Miao Liu, Zheming Yang, Minghui Qiu, et al. Cure: Critical-token-guided re-concatenation for entropy- collapse prevention.arXiv preprint arXiv:2508.11016,

  13. [13]

    How does rl policy entropy converge during iteration?https://zhuanlan.zhihu

    Jiacai Liu. How does rl policy entropy converge during iteration?https://zhuanlan.zhihu. com/p/28476703733,

  14. [14]

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    Zhihu Column. Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864,

  15. [15]

    Asynchronous methods for deep reinforcement learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on machine learning, pp. 1928–1937. PmLR,

  16. [16]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  17. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

  18. [18]

    Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

    Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941,

  19. [19]

    Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349,

    Hongze Tan and Jianfei Pan. Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349,

  20. [20]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

  21. [21]

    Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

    12 Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hi- erarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025a. Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, and Guorui Zhou. Stabilizing knowledge, pro- mo...

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025b. E...

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  24. [24]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does re- inforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

  25. [25]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

  26. [26]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025a. Ruipeng Zhang, Ya-Chien Chang, and Sicun Gao. When maximum entropy misleads policy opti- mization.arXiv preprint arXiv:2506.0561...

  27. [27]

    One typical approach to address entropy collapse is by raising the sampling temperature during inference

    is excluded in our work, since its practical impact is often negligible or counterproductive for reasoning tasks, as demonstrated in recent works (Yu et al., 2025; Chu et al., 2025; Hu et al., 2025). One typical approach to address entropy collapse is by raising the sampling temperature during inference. However, recent findings in (Luo et al.,

  28. [28]

    suggest that while this method postpones the onset of entropy collapse, it does not prevent it, as entropy continues to decrease progressively throughout the training process. Recent studies have sought to mitigate entropy collapse by adjusting key elements of policy optimization, such as PPO-style ratio clipping (Yu et al., 2025; Yang et al., 2025b), bal...

  29. [29]

    (Yang et al., 2025b)I clip =    0, A i,t >0andr i,t >1 +ε high, 0, A i,t <0andr i,t <1−ε low, 1,otherwise KL penalty (Shao et al., 2024)R(π θ) = πref(oi,t|q,oi,<t) πθ(oi,t|q,oi,<t) Entropy Regularization (He et al., 2025b)R(π θ) =−logπ θ(oi,t |q, o i,<t) Unlikeliness (He et al., 2025a) ˆRi,t =R i,t 1−β rank G−rank(oi) G ,β rank >0 W-REINFORCE (Zhu et a...

  30. [30]

    logπ θ(a|s i,t) X a′∈A ∂logπ θ(a|s i,t) ∂θsi,t,a′ θk+1 si,t,a′ −θ k si,t,a′ # =−E a∼πk θ (·|si,t)

    The change of conditional entropy between two update steps is defined as∆H it ≜H(π k+1 θ |s i,t)− H(πk θ |s i,t). Then the first-order estimation of∆H it in Eq. 2 is Ωi,t =−ηE a∼πk θ (·|si,t) wi,t(1−π k θ (a|si,t))2 (logπ k θ (a|si,t) +H(π k θ |s i,t)),(14) whereηis the learning rate,w i,t =I clip ri,t Ai,t is per-token weight. 15 Rethinking Entropy Inter...