arxiv: 2605.11775 · v2 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Jiazheng Zhang , Ziche Fu , Junrui Shen , Yunbin Zhao , Yunke Zhang , Zhiheng Xi , Long Ma , Chenxin An

show 12 more authors

Zhihao Zhang Shichun Liu Dingwei Zhu Shihan Dou Shaofan Liu Han Li Wiggin Zhou Aiden Adams Tao Gui Fei Huang Qi Zhang Xuanjing Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords entropy polaritypolicy entropyreinforcement fine-tuninglarge language modelsexploration controlRLVRpolicy optimizationtoken-level analysis

0 comments

The pith

Entropy polarity, a signed token-level quantity, predicts whether policy updates expand or contract entropy in reinforcement fine-tuning of language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework for entropy mechanics in reinforcement learning with verifiable rewards for large language models. It derives a first-order approximation of entropy change that produces entropy polarity, a signed measure at each token showing whether an update will increase or decrease overall policy entropy. The work identifies a structural asymmetry in which updates on high-probability tokens drive entropy contraction while expansion typically needs lower-probability tokens. From this foundation the authors introduce Polarity-Aware Policy Optimization, which balances both polarity directions through advantage reweighting and uses observed entropy trajectories to adjust optimization pressure dynamically.

Core claim

In RLVR for LLMs, entropy change admits a first-order approximation that defines entropy polarity, a signed token-level quantity predicting the direction and magnitude of entropy modification by a sampled update. Reinforcing frequent high-probability tokens produces contraction tendencies, whereas expansive tendencies arise mainly from lower-probability samples or stronger distributional correction. This asymmetry implies that positive and negative polarity branches play complementary roles, which Polarity-Aware Policy Optimization exploits by preserving both branches and reallocating pressure adaptively according to the empirical entropy trajectory.

What carries the argument

Entropy polarity: a signed token-level quantity obtained from the first-order approximation of entropy change, which indicates whether a given policy update expands or contracts entropy.

If this is right

Positive-polarity updates preserve exploration by expanding entropy while negative-polarity updates strengthen exploitation by contracting it.
Advantage reweighting that preserves both polarity branches allows simultaneous improvement in reward and training efficiency.
Adaptive reallocation of optimization pressure based on the running entropy trajectory yields consistent gains on mathematical reasoning and agentic tasks.
The polarity framework supplies a token-level signal that can be monitored online to maintain a desired entropy level without external regularizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Entropy polarity could serve as an online diagnostic to detect and counteract premature entropy collapse in other fine-tuning regimes beyond RLVR.
The observed contraction bias for high-probability tokens may generalize to explain rapid overfitting patterns in non-LLM reinforcement learning.
Combining polarity signals with existing entropy bonuses or KL penalties could produce more stable multi-objective control in policy optimization.
Token-level polarity tracking might enable finer-grained intervention, such as selectively amplifying expansive updates only on reasoning-critical tokens.

Load-bearing premise

The first-order approximation of entropy change accurately captures the dominant mechanism by which sampled policy updates reshape token-level entropy in RLVR for LLMs.

What would settle it

Compute entropy polarity for each token in sampled updates during RLVR training and compare the predicted direction against the actual measured change in token entropy; systematic mismatch between predicted and observed signs would falsify the approximation.

read the original abstract

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces entropy polarity from a first-order approximation as a token-level control signal and builds PAPO around it, but the approximation's accuracy for real policy shifts remains the main open question.

read the letter

The main takeaway is that this work derives a signed token-level quantity called entropy polarity from a first-order Taylor expansion of entropy change under policy updates, then uses the positive and negative branches to guide advantage reweighting in their PAPO algorithm. They also note an asymmetry where high-probability tokens tend to contract entropy while expansion often needs lower-probability samples or stronger corrections. That local view is the fresh part compared with the global entropy penalties cited in the abstract. They show it correlates with observed entropy trajectories and that PAPO improves reward and efficiency on math reasoning and agent benchmarks over standard baselines. The framework gives a clean story for why token-level signals might beat global objectives when balancing exploration and exploitation in LLM RLVR. The experiments appear to support the practical payoff, at least directionally. The soft spot is the first-order approximation itself. The stress-test concern holds: without bounds on the remainder term or quantification of how large the probability shifts get during training, it is unclear how well the linear term dominates, especially for the low-probability tokens that matter most for expansion. The abstract and reported results mention correlations but do not detail error bars, ablation on the approximation error, or validity conditions for the range of KL shifts they actually see. The adaptive reweighting also leans on the empirical entropy trajectory as an online signal, which works but adds a mild dependence that could be tightened. This paper is aimed at researchers tuning reasoning or agentic LLMs who already care about entropy control. Anyone looking for a principled local mechanism rather than another global regularizer will find the derivation and the PAPO implementation useful to try or extend. It deserves a serious referee because the core idea is new, the empirical gains are shown, and the theoretical framing is coherent even if the approximation needs more work. I would send it to review with a request to strengthen the error analysis on the Taylor step.

Referee Report

3 major / 3 minor

Summary. The paper develops a theoretical framework for entropy mechanics in RLVR for LLMs. It derives a first-order approximation of token-level entropy change under sampled policy updates, introducing entropy polarity as a signed token-level quantity that predicts entropy expansion or contraction. The analysis identifies a structural asymmetry: high-probability tokens induce contraction while expansion requires lower-probability samples or stronger correction. Empirically, polarity is shown to correlate with observed entropy trajectories; the proposed PAPO method uses polarity-aware advantage reweighting with the empirical entropy trajectory as an online signal to balance the two branches, yielding improved performance on mathematical reasoning and agentic benchmarks.

Significance. If the first-order approximation holds with controllable error, the work supplies a token-level mechanistic account of how policy updates reshape entropy, moving beyond global regularization. The asymmetry result and PAPO controller could enable more targeted exploration-exploitation trade-offs in LLM fine-tuning. The empirical correlation and benchmark gains, if statistically robust, would constitute a practical contribution to entropy-aware RLVR methods.

major comments (3)

[§3.2, Eq. (7)] §3.2, Eq. (7): The first-order Taylor expansion for token-level entropy change ΔH_i is stated without the Lagrange remainder or any explicit bound on higher-order terms as a function of the local probability shift |Δπ_i|. No analysis is given for the regime of typical RLVR KL divergences or max-probability shifts where the linear term dominates, which is required for polarity to reliably predict sign and magnitude of entropy change.
[Experiments section, Tables 2–4] Experiments section, Tables 2–4: Reported performance gains for PAPO versus baselines are presented without error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the observed reward and efficiency improvements are stable or could be explained by variance in the RLVR training runs.
[§4.3] §4.3: The adaptive reweighting in PAPO is described as using the empirical entropy trajectory as an online phase signal, yet no ablation is reported that isolates the contribution of polarity-based branching versus simple entropy-target tracking. This leaves open whether the polarity construct itself is load-bearing for the claimed gains.

minor comments (3)

[§3.1] Notation for token-level entropy H_i and polarity P_i is introduced without an explicit comparison table to prior global entropy measures (e.g., those in PPO or GRPO), which would help readers situate the new quantities.
[Figure 3] Figure 3 caption does not state the number of tokens or trajectories aggregated; axis labels use inconsistent font sizes with the main text.
[Abstract and §5] The abstract claims 'substantial reward improvements' but the main text does not quantify the absolute reward deltas or normalize them against the baseline variance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in both the theoretical analysis and empirical validation. We provide point-by-point responses below and commit to making the necessary revisions.

read point-by-point responses

Referee: §3.2, Eq. (7): The first-order Taylor expansion for token-level entropy change ΔH_i is stated without the Lagrange remainder or any explicit bound on higher-order terms as a function of the local probability shift |Δπ_i|. No analysis is given for the regime of typical RLVR KL divergences or max-probability shifts where the linear term dominates, which is required for polarity to reliably predict sign and magnitude of entropy change.

Authors: We agree that an explicit error analysis would strengthen the theoretical foundation. In the revised manuscript, we will derive the Lagrange remainder for the Taylor expansion of the entropy function and provide a bound on the higher-order terms in terms of |Δπ_i|. Furthermore, we will include an analysis of typical RLVR KL divergence values (commonly in the range of 0.01 to 0.05) and demonstrate, both theoretically and via additional experiments, that the first-order term dominates under these conditions, thereby validating the use of entropy polarity for sign prediction. revision: yes
Referee: Experiments section, Tables 2–4: Reported performance gains for PAPO versus baselines are presented without error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the observed reward and efficiency improvements are stable or could be explained by variance in the RLVR training runs.

Authors: This is a valid concern regarding the robustness of our empirical results. We will revise the experiments section to include results from multiple random seeds (at least three), report means with standard deviations as error bars in Tables 2-4, and perform statistical significance tests (e.g., paired t-tests) to confirm that the improvements are statistically significant and not due to training variance. revision: yes
Referee: §4.3: The adaptive reweighting in PAPO is described as using the empirical entropy trajectory as an online phase signal, yet no ablation is reported that isolates the contribution of polarity-based branching versus simple entropy-target tracking. This leaves open whether the polarity construct itself is load-bearing for the claimed gains.

Authors: We appreciate the suggestion to better isolate the effect of polarity. In the revised paper, we will add a new ablation experiment comparing the full PAPO method against a baseline that performs entropy-target tracking without the polarity-based branching mechanism. This will clarify the specific role of the polarity construct in achieving the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in entropy polarity derivation

full rationale

The paper's central derivation obtains entropy polarity from a first-order Taylor expansion of token-level entropy change under a sampled policy update. This is a standard analytic approximation whose linear term is defined directly from the entropy function and the probability shift; it is not obtained by fitting to the target entropy trajectory or by redefining the quantity in terms of itself. The subsequent empirical correlation checks and the PAPO adaptive reweighting (which uses the observed entropy trajectory only as an online phase signal) are downstream applications, not inputs that force the polarity definition. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The derivation chain is therefore mathematically self-contained and independent of the quantities it is later used to predict.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on a first-order approximation whose validity is assumed without independent verification in the provided abstract; entropy polarity is introduced as a new derived quantity; PAPO introduces adaptive reweighting whose parameters are not enumerated.

free parameters (1)

advantage reweighting coefficients
Used to preserve both polarity branches; specific values or fitting procedure not stated in abstract.

axioms (1)

domain assumption First-order approximation sufficiently captures entropy change under sampled policy updates in RLVR
Invoked to derive the polarity quantity from the entropy mechanics analysis.

invented entities (1)

entropy polarity no independent evidence
purpose: Signed token-level predictor of entropy expansion or contraction
New quantity introduced via the first-order approximation; no independent falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5601 in / 1427 out tokens · 43999 ms · 2026-05-15T06:00:17.786406+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (First-order Entropy Change via Sampled Updates). ... ΔH_t = -η A t1(s_t, y_t) + η A t2(s_t) + O(η²) where t1 = p_t (H_t + log p_t), t2 = ∑ p_v² (H_t + log p_v)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

entropy polarity P(s_t, y_t, A) := A T(s_t, y_t) ... positive/negative polarity branches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 14 internal anchors

[1]

, author=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. , author=. Nature , volume=

work page
[5]

2026 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

work page 2026
[6]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

work page
[15]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for

Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xiong-Hui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin , booktitle=. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement...

work page 2026
[16]

CoRR , volume =

Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.15248 , eprinttype =. 2511.15248 , timestamp =

work page doi:10.48550/arxiv.2511.15248 2025
[18]

2026 , url=

Chen Huang and Wei Lu and Wenxuan Zhang , booktitle=. 2026 , url=

work page 2026
[19]

Beyond Magnitude: Leveraging Direction of

Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , booktitle=. Beyond Magnitude: Leveraging Direction of. 2026 , url=

work page 2026
[20]

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in

Haoming Meng and Kexin Huang and Shaohang Wei and Chiyu Ma and Shuo Yang and Xue Wang and Guoyin Wang and Bolin Ding and Jingren Zhou , booktitle=. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in. 2026 , url=

work page 2026
[21]

2026 , url=

Zhiheng Xi and Xin Guo and Yang Nan and Enyu Zhou and Junrui Shen and Wenxiang Chen and Jiaqi Liu and Jixuan Huang and Xun Deng and Zhihao Zhang and Honglin Guo and Zhikai Lei and Miao Zheng and Guoteng Wang and Peng Sun and Rui Zheng and Hang Yan and Tao Gui and Qi Zhang and Xuanjing Huang , booktitle=. 2026 , url=

work page 2026
[24]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[25]

The Twelfth International Conference on Learning Representations,

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[26]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

work page
[27]

2024 , note=

American invitational mathematics examination (aime) 2024 , author=. 2024 , note=

work page 2024
[28]

2025 , note=

American invitational mathematics examination (aime) 2025 , author=. 2025 , note=

work page 2025
[29]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[31]

Forty-second International Conference on Machine Learning , year=

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

work page
[33]

Proceedings of the 41st International Conference on Machine Learning , pages=

CRUXEval: a benchmark for code reasoning, understanding and execution , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[34]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[35]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

CoRR , volume =

Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22117 , eprinttype =. 2603.22117 , timestamp =

work page doi:10.48550/arxiv.2603.22117 2026
[44]

Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =

Evan Zheran Liu and Aditi Raghunathan and Percy Liang and Chelsea Finn , editor =. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =. 2021 , url =

work page 2021
[49]

2026 , eprint=

AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. 2026 , eprint=

work page 2026
[52]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp.github.io/blog/2025/Polaris

work page 2025
[53]

Claude code, 2025

Anthropic. Claude code, 2025. URL [https://docs.anthropic.com/en/docs/claude-code](https://docs.anthropic.com/en/docs/claude-code)

work page 2025
[54]

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

Sikai Bai, Haoxi Li, Jie Zhang, Yongjiang Liu, and Song Guo. Ttvs: Boosting self-exploring reinforcement learning via test-time variational synthesis. arXiv preprint arXiv:2604.08468, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a

work page arXiv 2025
[56]

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, and Wenji Mao. Flexible entropy control in RLVR with gradient-preserving perspective. CoRR, abs/2602.09782, 2026. doi:10.48550/ARXIV.2602.09782. URL https://doi.org/10.48550/arXiv.2602.09782

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.09782 2026
[57]

Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward. CoRR, abs/2512.16912, 2025 b . doi:10.48550/ARXIV.2512.16912. URL https://doi.org/10.48550/arXiv.2512.16912

work page doi:10.48550/arxiv.2512.16912 2025
[58]

Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms

Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms. CoRR, abs/2512.00908, 2025 c . doi:10.48550/ARXIV.2512.00908. URL https://doi.org/10.48550/arXiv.2512.00908

work page doi:10.48550/arxiv.2512.00908 2025
[59]

Reasoning with exploration: An entropy perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors, Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium...

work page doi:10.1609/aaai.v40i36.40290 2026
[60]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models. CoRR, abs/2505.22617, 2025. doi:10.48550/ARXIV.2505.22617. URL https://doi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22617 2025
[61]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. CoRR, abs/2505.10978, 2025. doi:10.48550/ARXIV.2505.10978. URL https://doi.org/10.48550/arXiv.2505.10978

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.10978 2025
[62]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xionghui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization. CoRR, abs/2511.20347, 2025. doi:10.48550/ARXIV.2511.20347. URL https://doi.org/10.48550/arXiv.2511.20347

work page internal anchor Pith review doi:10.48550/arxiv.2511.20347 2025
[63]

Cruxeval: a benchmark for code reasoning, understanding and execution

Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, pages 16568--16621, 2024

work page 2024
[64]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

D Guo, D Yang, H Zhang, J Song, P Wang, Q Zhu, R Xu, R Zhang, S Ma, X Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

work page 2025
[65]

Justrl: Scaling a 1.5 b llm with a simple rl recipe

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe. arXiv preprint arXiv:2512.16649, 2025

work page arXiv 2025
[66]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2024
[67]

Beyond magnitude: Leveraging direction of RLVR updates for LLM reasoning

Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. Beyond magnitude: Leveraging direction of RLVR updates for LLM reasoning. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=r6Pw3RiMYL

work page 2026
[68]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13 0 (9): 0 9, 2024

work page 2024
[70]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

work page 2024
[71]

Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices

Evan Zheran Liu, Aditi Raghunathan, Percy Liang, and Chelsea Finn. Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , Proceedings of Machine Learning Research, p...

work page 2021
[72]

Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLM s

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLM s. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=8vWIXno8LW

work page 2026
[73]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[74]

a henb \

Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco F. Cusumano - Towner, Raja Giryes, and Philipp Kr \" a henb \" u hl. Entropy-preserving reinforcement learning. CoRR, abs/2603.11682, 2026. doi:10.48550/ARXIV.2603.11682. URL https://doi.org/10.48550/arXiv.2603.11682

work page doi:10.48550/arxiv.2603.11682 2026
[75]

arXiv preprint arXiv:2505.22660

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning. CoRR, abs/2505.22660, 2025. doi:10.48550/ARXIV.2505.22660. URL https://doi.org/10.48550/arXiv.2505.22660

work page doi:10.48550/arxiv.2505.22660 2025
[76]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

On entropy control in LLM-RL algorithms

Han Shen. On entropy control in LLM-RL algorithms. CoRR, abs/2509.03493, 2025. doi:10.48550/ARXIV.2509.03493. URL https://doi.org/10.48550/arXiv.2509.03493

work page doi:10.48550/arxiv.2509.03493 2025
[78]

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. CE-GPPO: coordinating entropy via gradient-preserving clipping policy optimization in reinforcement learning. CoRR, abs/2509.20712, 2025. doi:10.48550/ARXIV.2509.20712. URL https://doi.org/10.48550/arXiv.2509.20712

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20712 2025
[79]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. IEEE Trans. Neural Networks , 9 0 (5): 0 1054--1054, 1998. doi:10.1109/TNN.1998.712192. URL https://doi.org/10.1109/TNN.1998.712192

work page doi:10.1109/tnn.1998.712192 1998
[80]

Rethinking sample polarity in reinforcement learning with verifiable rewards

Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Rethinking sample polarity in reinforcement learning with verifiable rewards. CoRR, abs/2512.21625, 2025. doi:10.48550/ARXIV.2512.21625. URL https://doi.org/10.48550/arXiv.2512.21625

work page doi:10.48550/arxiv.2512.21625 2025
[81]

Skip-Connected Policy Optimization for Implicit Advantage

Fengwei Teng, Jinyi Bai, Xinhao Yao, Demi Ruohan Wang, Jiahao Zhao, and Zhijiang Guo. Skip-connected policy optimization for implicit advantage. arXiv preprint arXiv:2604.08690, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[82]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. In The Thirty-ninth Annual Conf...

work page 2026
[83]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024

work page 2024
[84]

Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, and Yu - Gang Jiang. Agentgym-rl: Training LLM agents for long-horizon decision making ...

work page doi:10.48550/arxiv.2509.08755 2025
[85]

Can RL improve generalization of LLM agents? an empirical study

Zhiheng Xi, Xin Guo, Jiaqi Liu, Jiazheng Zhang, Yutao Fan, Zhihao Zhang, Shichun Liu, Mingxu Chai, Xiaowei Shi, Yitao Zhai, Xunliang Cai, Tao Gui, Qi Zhang, and Xuanjing Huang. Can RL improve generalization of LLM agents? an empirical study. CoRR, abs/2603.12011, 2026 a . doi:10.48550/ARXIV.2603.12011. URL https://doi.org/10.48550/arXiv.2603.12011

work page doi:10.48550/arxiv.2603.12011 2026
[86]

BAPO : Stabilizing off-policy reinforcement learning for LLM s via balanced policy optimization with adaptive clipping

Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Xun Deng, Zhihao Zhang, Honglin Guo, Zhikai Lei, Miao Zheng, Guoteng Wang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, and Xuanjing Huang. BAPO : Stabilizing off-policy reinforcement learning for LLM s via balanced policy optimization with adaptive clippin...

work page 2026
[87]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[88]

DAPO : An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

work page 2026
[90]

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, et al. Agentv-rl: Scaling reward modeling with agentic verifier. arXiv preprint arXiv:2604.16004, 2026 b

work page internal anchor Pith review Pith/arXiv arXiv 2026
[91]

A survey of reinforcement learning for large reasoning models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page doi:10.48550/arxiv.2509.08827 2025
[92]

Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning. arXiv preprint arXiv:2505.00024, 2025 b

work page arXiv 2025
[93]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024. Contest problem collection

work page 2024
[94]

Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, et al. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective. arXiv preprint arXiv:2506.23508, 2025 c

work page arXiv 2025
[95]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[96]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023