Recognition: 2 theorem links
· Lean TheoremEntropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Pith reviewed 2026-05-15 06:00 UTC · model grok-4.3
The pith
Entropy polarity, a signed token-level quantity, predicts whether policy updates expand or contract entropy in reinforcement fine-tuning of language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In RLVR for LLMs, entropy change admits a first-order approximation that defines entropy polarity, a signed token-level quantity predicting the direction and magnitude of entropy modification by a sampled update. Reinforcing frequent high-probability tokens produces contraction tendencies, whereas expansive tendencies arise mainly from lower-probability samples or stronger distributional correction. This asymmetry implies that positive and negative polarity branches play complementary roles, which Polarity-Aware Policy Optimization exploits by preserving both branches and reallocating pressure adaptively according to the empirical entropy trajectory.
What carries the argument
Entropy polarity: a signed token-level quantity obtained from the first-order approximation of entropy change, which indicates whether a given policy update expands or contracts entropy.
If this is right
- Positive-polarity updates preserve exploration by expanding entropy while negative-polarity updates strengthen exploitation by contracting it.
- Advantage reweighting that preserves both polarity branches allows simultaneous improvement in reward and training efficiency.
- Adaptive reallocation of optimization pressure based on the running entropy trajectory yields consistent gains on mathematical reasoning and agentic tasks.
- The polarity framework supplies a token-level signal that can be monitored online to maintain a desired entropy level without external regularizers.
Where Pith is reading between the lines
- Entropy polarity could serve as an online diagnostic to detect and counteract premature entropy collapse in other fine-tuning regimes beyond RLVR.
- The observed contraction bias for high-probability tokens may generalize to explain rapid overfitting patterns in non-LLM reinforcement learning.
- Combining polarity signals with existing entropy bonuses or KL penalties could produce more stable multi-objective control in policy optimization.
- Token-level polarity tracking might enable finer-grained intervention, such as selectively amplifying expansive updates only on reasoning-critical tokens.
Load-bearing premise
The first-order approximation of entropy change accurately captures the dominant mechanism by which sampled policy updates reshape token-level entropy in RLVR for LLMs.
What would settle it
Compute entropy polarity for each token in sampled updates during RLVR training and compare the predicted direction against the actual measured change in token entropy; systematic mismatch between predicted and observed signs would falsify the approximation.
read the original abstract
Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a theoretical framework for entropy mechanics in RLVR for LLMs. It derives a first-order approximation of token-level entropy change under sampled policy updates, introducing entropy polarity as a signed token-level quantity that predicts entropy expansion or contraction. The analysis identifies a structural asymmetry: high-probability tokens induce contraction while expansion requires lower-probability samples or stronger correction. Empirically, polarity is shown to correlate with observed entropy trajectories; the proposed PAPO method uses polarity-aware advantage reweighting with the empirical entropy trajectory as an online signal to balance the two branches, yielding improved performance on mathematical reasoning and agentic benchmarks.
Significance. If the first-order approximation holds with controllable error, the work supplies a token-level mechanistic account of how policy updates reshape entropy, moving beyond global regularization. The asymmetry result and PAPO controller could enable more targeted exploration-exploitation trade-offs in LLM fine-tuning. The empirical correlation and benchmark gains, if statistically robust, would constitute a practical contribution to entropy-aware RLVR methods.
major comments (3)
- [§3.2, Eq. (7)] §3.2, Eq. (7): The first-order Taylor expansion for token-level entropy change ΔH_i is stated without the Lagrange remainder or any explicit bound on higher-order terms as a function of the local probability shift |Δπ_i|. No analysis is given for the regime of typical RLVR KL divergences or max-probability shifts where the linear term dominates, which is required for polarity to reliably predict sign and magnitude of entropy change.
- [Experiments section, Tables 2–4] Experiments section, Tables 2–4: Reported performance gains for PAPO versus baselines are presented without error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the observed reward and efficiency improvements are stable or could be explained by variance in the RLVR training runs.
- [§4.3] §4.3: The adaptive reweighting in PAPO is described as using the empirical entropy trajectory as an online phase signal, yet no ablation is reported that isolates the contribution of polarity-based branching versus simple entropy-target tracking. This leaves open whether the polarity construct itself is load-bearing for the claimed gains.
minor comments (3)
- [§3.1] Notation for token-level entropy H_i and polarity P_i is introduced without an explicit comparison table to prior global entropy measures (e.g., those in PPO or GRPO), which would help readers situate the new quantities.
- [Figure 3] Figure 3 caption does not state the number of tokens or trajectories aggregated; axis labels use inconsistent font sizes with the main text.
- [Abstract and §5] The abstract claims 'substantial reward improvements' but the main text does not quantify the absolute reward deltas or normalize them against the baseline variance.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas for improvement in both the theoretical analysis and empirical validation. We provide point-by-point responses below and commit to making the necessary revisions.
read point-by-point responses
-
Referee: §3.2, Eq. (7): The first-order Taylor expansion for token-level entropy change ΔH_i is stated without the Lagrange remainder or any explicit bound on higher-order terms as a function of the local probability shift |Δπ_i|. No analysis is given for the regime of typical RLVR KL divergences or max-probability shifts where the linear term dominates, which is required for polarity to reliably predict sign and magnitude of entropy change.
Authors: We agree that an explicit error analysis would strengthen the theoretical foundation. In the revised manuscript, we will derive the Lagrange remainder for the Taylor expansion of the entropy function and provide a bound on the higher-order terms in terms of |Δπ_i|. Furthermore, we will include an analysis of typical RLVR KL divergence values (commonly in the range of 0.01 to 0.05) and demonstrate, both theoretically and via additional experiments, that the first-order term dominates under these conditions, thereby validating the use of entropy polarity for sign prediction. revision: yes
-
Referee: Experiments section, Tables 2–4: Reported performance gains for PAPO versus baselines are presented without error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the observed reward and efficiency improvements are stable or could be explained by variance in the RLVR training runs.
Authors: This is a valid concern regarding the robustness of our empirical results. We will revise the experiments section to include results from multiple random seeds (at least three), report means with standard deviations as error bars in Tables 2-4, and perform statistical significance tests (e.g., paired t-tests) to confirm that the improvements are statistically significant and not due to training variance. revision: yes
-
Referee: §4.3: The adaptive reweighting in PAPO is described as using the empirical entropy trajectory as an online phase signal, yet no ablation is reported that isolates the contribution of polarity-based branching versus simple entropy-target tracking. This leaves open whether the polarity construct itself is load-bearing for the claimed gains.
Authors: We appreciate the suggestion to better isolate the effect of polarity. In the revised paper, we will add a new ablation experiment comparing the full PAPO method against a baseline that performs entropy-target tracking without the polarity-based branching mechanism. This will clarify the specific role of the polarity construct in achieving the reported performance gains. revision: yes
Circularity Check
No significant circularity in entropy polarity derivation
full rationale
The paper's central derivation obtains entropy polarity from a first-order Taylor expansion of token-level entropy change under a sampled policy update. This is a standard analytic approximation whose linear term is defined directly from the entropy function and the probability shift; it is not obtained by fitting to the target entropy trajectory or by redefining the quantity in terms of itself. The subsequent empirical correlation checks and the PAPO adaptive reweighting (which uses the observed entropy trajectory only as an online phase signal) are downstream applications, not inputs that force the polarity definition. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The derivation chain is therefore mathematically self-contained and independent of the quantities it is later used to predict.
Axiom & Free-Parameter Ledger
free parameters (1)
- advantage reweighting coefficients
axioms (1)
- domain assumption First-order approximation sufficiently captures entropy change under sampled policy updates in RLVR
invented entities (1)
-
entropy polarity
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (First-order Entropy Change via Sampled Updates). ... ΔH_t = -η A t1(s_t, y_t) + η A t2(s_t) + O(η²) where t1 = p_t (H_t + log p_t), t2 = ∑ p_v² (H_t + log p_v)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
entropy polarity P(s_t, y_t, A) := A T(s_t, y_t) ... positive/negative polarity branches
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[5]
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...
work page 2026
-
[6]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =
-
[15]
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for
Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xiong-Hui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin , booktitle=. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement...
work page 2026
-
[16]
Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.15248 , eprinttype =. 2511.15248 , timestamp =
- [18]
-
[19]
Beyond Magnitude: Leveraging Direction of
Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , booktitle=. Beyond Magnitude: Leveraging Direction of. 2026 , url=
work page 2026
-
[20]
Sparse but Critical: A Token-Level Analysis of Distributional Shifts in
Haoming Meng and Kexin Huang and Shaohang Wei and Chiyu Ma and Shuo Yang and Xue Wang and Guoyin Wang and Bolin Ding and Jingren Zhou , booktitle=. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in. 2026 , url=
work page 2026
-
[21]
Zhiheng Xi and Xin Guo and Yang Nan and Enyu Zhou and Junrui Shen and Wenxiang Chen and Jiaqi Liu and Jixuan Huang and Xun Deng and Zhihao Zhang and Honglin Guo and Zhikai Lei and Miao Zheng and Guoteng Wang and Peng Sun and Rui Zheng and Hang Yan and Tao Gui and Qi Zhang and Xuanjing Huang , booktitle=. 2026 , url=
work page 2026
-
[24]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[25]
The Twelfth International Conference on Learning Representations,
Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[26]
Hugging Face repository , volume=
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
-
[27]
American invitational mathematics examination (aime) 2024 , author=. 2024 , note=
work page 2024
-
[28]
American invitational mathematics examination (aime) 2025 , author=. 2025 , note=
work page 2025
-
[29]
Advances in neural information processing systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
-
[31]
Forty-second International Conference on Machine Learning , year=
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=
-
[33]
Proceedings of the 41st International Conference on Machine Learning , pages=
CRUXEval: a benchmark for code reasoning, understanding and execution , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[34]
The Thirteenth International Conference on Learning Representations , year=
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=
-
[35]
Advances in Neural Information Processing Systems , volume=
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
-
[41]
Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22117 , eprinttype =. 2603.22117 , timestamp =
-
[44]
Evan Zheran Liu and Aditi Raghunathan and Percy Liang and Chelsea Finn , editor =. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =. 2021 , url =
work page 2021
-
[49]
AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. 2026 , eprint=
work page 2026
-
[52]
Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp.github.io/blog/2025/Polaris
work page 2025
-
[53]
Anthropic. Claude code, 2025. URL [https://docs.anthropic.com/en/docs/claude-code](https://docs.anthropic.com/en/docs/claude-code)
work page 2025
-
[54]
TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis
Sikai Bai, Haoxi Li, Jie Zhang, Yongjiang Liu, and Song Guo. Ttvs: Boosting self-exploring reinforcement learning via test-time variational synthesis. arXiv preprint arXiv:2604.08468, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a
Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a
-
[56]
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, and Wenji Mao. Flexible entropy control in RLVR with gradient-preserving perspective. CoRR, abs/2602.09782, 2026. doi:10.48550/ARXIV.2602.09782. URL https://doi.org/10.48550/arXiv.2602.09782
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.09782 2026
-
[57]
Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward
Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward. CoRR, abs/2512.16912, 2025 b . doi:10.48550/ARXIV.2512.16912. URL https://doi.org/10.48550/arXiv.2512.16912
-
[58]
Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms. CoRR, abs/2512.00908, 2025 c . doi:10.48550/ARXIV.2512.00908. URL https://doi.org/10.48550/arXiv.2512.00908
-
[59]
Reasoning with exploration: An entropy perspective
Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors, Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium...
-
[60]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models. CoRR, abs/2505.22617, 2025. doi:10.48550/ARXIV.2505.22617. URL https://doi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22617 2025
-
[61]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. CoRR, abs/2505.10978, 2025. doi:10.48550/ARXIV.2505.10978. URL https://doi.org/10.48550/arXiv.2505.10978
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.10978 2025
-
[62]
Soft Adaptive Policy Optimization
Chang Gao, Chujie Zheng, Xionghui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization. CoRR, abs/2511.20347, 2025. doi:10.48550/ARXIV.2511.20347. URL https://doi.org/10.48550/arXiv.2511.20347
work page internal anchor Pith review doi:10.48550/arxiv.2511.20347 2025
-
[63]
Cruxeval: a benchmark for code reasoning, understanding and execution
Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, pages 16568--16621, 2024
work page 2024
-
[64]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
D Guo, D Yang, H Zhang, J Song, P Wang, Q Zhu, R Xu, R Zhang, S Ma, X Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025
work page 2025
-
[65]
Justrl: Scaling a 1.5 b llm with a simple rl recipe
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe. arXiv preprint arXiv:2512.16649, 2025
-
[66]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
work page 2024
-
[67]
Beyond magnitude: Leveraging direction of RLVR updates for LLM reasoning
Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. Beyond magnitude: Leveraging direction of RLVR updates for LLM reasoning. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=r6Pw3RiMYL
work page 2026
-
[68]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13 0 (9): 0 9, 2024
work page 2024
-
[70]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[71]
Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices
Evan Zheran Liu, Aditi Raghunathan, Percy Liang, and Chelsea Finn. Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , Proceedings of Machine Learning Research, p...
work page 2021
-
[72]
Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLM s
Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLM s. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=8vWIXno8LW
work page 2026
-
[73]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[74]
Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco F. Cusumano - Towner, Raja Giryes, and Philipp Kr \" a henb \" u hl. Entropy-preserving reinforcement learning. CoRR, abs/2603.11682, 2026. doi:10.48550/ARXIV.2603.11682. URL https://doi.org/10.48550/arXiv.2603.11682
-
[75]
arXiv preprint arXiv:2505.22660
Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning. CoRR, abs/2505.22660, 2025. doi:10.48550/ARXIV.2505.22660. URL https://doi.org/10.48550/arXiv.2505.22660
-
[76]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
On entropy control in LLM-RL algorithms
Han Shen. On entropy control in LLM-RL algorithms. CoRR, abs/2509.03493, 2025. doi:10.48550/ARXIV.2509.03493. URL https://doi.org/10.48550/arXiv.2509.03493
-
[78]
Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. CE-GPPO: coordinating entropy via gradient-preserving clipping policy optimization in reinforcement learning. CoRR, abs/2509.20712, 2025. doi:10.48550/ARXIV.2509.20712. URL https://doi.org/10.48550/arXiv.2509.20712
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20712 2025
-
[79]
Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. IEEE Trans. Neural Networks , 9 0 (5): 0 1054--1054, 1998. doi:10.1109/TNN.1998.712192. URL https://doi.org/10.1109/TNN.1998.712192
-
[80]
Rethinking sample polarity in reinforcement learning with verifiable rewards
Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Rethinking sample polarity in reinforcement learning with verifiable rewards. CoRR, abs/2512.21625, 2025. doi:10.48550/ARXIV.2512.21625. URL https://doi.org/10.48550/arXiv.2512.21625
-
[81]
Skip-Connected Policy Optimization for Implicit Advantage
Fengwei Teng, Jinyi Bai, Xinhao Yao, Demi Ruohan Wang, Jiahao Zhao, and Zhijiang Guo. Skip-connected policy optimization for implicit advantage. arXiv preprint arXiv:2604.08690, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[82]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. In The Thirty-ninth Annual Conf...
work page 2026
-
[83]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024
work page 2024
-
[84]
Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, and Yu - Gang Jiang. Agentgym-rl: Training LLM agents for long-horizon decision making ...
-
[85]
Can RL improve generalization of LLM agents? an empirical study
Zhiheng Xi, Xin Guo, Jiaqi Liu, Jiazheng Zhang, Yutao Fan, Zhihao Zhang, Shichun Liu, Mingxu Chai, Xiaowei Shi, Yitao Zhai, Xunliang Cai, Tao Gui, Qi Zhang, and Xuanjing Huang. Can RL improve generalization of LLM agents? an empirical study. CoRR, abs/2603.12011, 2026 a . doi:10.48550/ARXIV.2603.12011. URL https://doi.org/10.48550/arXiv.2603.12011
-
[86]
Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Xun Deng, Zhihao Zhang, Honglin Guo, Zhikai Lei, Miao Zheng, Guoteng Wang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, and Xuanjing Huang. BAPO : Stabilizing off-policy reinforcement learning for LLM s via balanced policy optimization with adaptive clippin...
work page 2026
-
[87]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[88]
DAPO : An open-source LLM reinforcement learning system at scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...
work page 2026
-
[90]
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, et al. Agentv-rl: Scaling reward modeling with agentic verifier. arXiv preprint arXiv:2604.16004, 2026 b
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[91]
A survey of reinforcement learning for large reasoning models
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...
-
[92]
Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning
Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning. arXiv preprint arXiv:2505.00024, 2025 b
-
[93]
American invitational mathematics examination (aime) 2024, 2024
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024. Contest problem collection
work page 2024
-
[94]
Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, et al. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective. arXiv preprint arXiv:2506.23508, 2025 c
-
[95]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[96]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.