pith. machine review for the scientific record. sign in

arxiv: 2605.11775 · v2 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:00 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords entropy polaritypolicy entropyreinforcement fine-tuninglarge language modelsexploration controlRLVRpolicy optimizationtoken-level analysis
0
0 comments X

The pith

Entropy polarity, a signed token-level quantity, predicts whether policy updates expand or contract entropy in reinforcement fine-tuning of language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework for entropy mechanics in reinforcement learning with verifiable rewards for large language models. It derives a first-order approximation of entropy change that produces entropy polarity, a signed measure at each token showing whether an update will increase or decrease overall policy entropy. The work identifies a structural asymmetry in which updates on high-probability tokens drive entropy contraction while expansion typically needs lower-probability tokens. From this foundation the authors introduce Polarity-Aware Policy Optimization, which balances both polarity directions through advantage reweighting and uses observed entropy trajectories to adjust optimization pressure dynamically.

Core claim

In RLVR for LLMs, entropy change admits a first-order approximation that defines entropy polarity, a signed token-level quantity predicting the direction and magnitude of entropy modification by a sampled update. Reinforcing frequent high-probability tokens produces contraction tendencies, whereas expansive tendencies arise mainly from lower-probability samples or stronger distributional correction. This asymmetry implies that positive and negative polarity branches play complementary roles, which Polarity-Aware Policy Optimization exploits by preserving both branches and reallocating pressure adaptively according to the empirical entropy trajectory.

What carries the argument

Entropy polarity: a signed token-level quantity obtained from the first-order approximation of entropy change, which indicates whether a given policy update expands or contracts entropy.

If this is right

  • Positive-polarity updates preserve exploration by expanding entropy while negative-polarity updates strengthen exploitation by contracting it.
  • Advantage reweighting that preserves both polarity branches allows simultaneous improvement in reward and training efficiency.
  • Adaptive reallocation of optimization pressure based on the running entropy trajectory yields consistent gains on mathematical reasoning and agentic tasks.
  • The polarity framework supplies a token-level signal that can be monitored online to maintain a desired entropy level without external regularizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Entropy polarity could serve as an online diagnostic to detect and counteract premature entropy collapse in other fine-tuning regimes beyond RLVR.
  • The observed contraction bias for high-probability tokens may generalize to explain rapid overfitting patterns in non-LLM reinforcement learning.
  • Combining polarity signals with existing entropy bonuses or KL penalties could produce more stable multi-objective control in policy optimization.
  • Token-level polarity tracking might enable finer-grained intervention, such as selectively amplifying expansive updates only on reasoning-critical tokens.

Load-bearing premise

The first-order approximation of entropy change accurately captures the dominant mechanism by which sampled policy updates reshape token-level entropy in RLVR for LLMs.

What would settle it

Compute entropy polarity for each token in sampled updates during RLVR training and compare the predicted direction against the actual measured change in token entropy; systematic mismatch between predicted and observed signs would falsify the approximation.

read the original abstract

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper develops a theoretical framework for entropy mechanics in RLVR for LLMs. It derives a first-order approximation of token-level entropy change under sampled policy updates, introducing entropy polarity as a signed token-level quantity that predicts entropy expansion or contraction. The analysis identifies a structural asymmetry: high-probability tokens induce contraction while expansion requires lower-probability samples or stronger correction. Empirically, polarity is shown to correlate with observed entropy trajectories; the proposed PAPO method uses polarity-aware advantage reweighting with the empirical entropy trajectory as an online signal to balance the two branches, yielding improved performance on mathematical reasoning and agentic benchmarks.

Significance. If the first-order approximation holds with controllable error, the work supplies a token-level mechanistic account of how policy updates reshape entropy, moving beyond global regularization. The asymmetry result and PAPO controller could enable more targeted exploration-exploitation trade-offs in LLM fine-tuning. The empirical correlation and benchmark gains, if statistically robust, would constitute a practical contribution to entropy-aware RLVR methods.

major comments (3)
  1. [§3.2, Eq. (7)] §3.2, Eq. (7): The first-order Taylor expansion for token-level entropy change ΔH_i is stated without the Lagrange remainder or any explicit bound on higher-order terms as a function of the local probability shift |Δπ_i|. No analysis is given for the regime of typical RLVR KL divergences or max-probability shifts where the linear term dominates, which is required for polarity to reliably predict sign and magnitude of entropy change.
  2. [Experiments section, Tables 2–4] Experiments section, Tables 2–4: Reported performance gains for PAPO versus baselines are presented without error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the observed reward and efficiency improvements are stable or could be explained by variance in the RLVR training runs.
  3. [§4.3] §4.3: The adaptive reweighting in PAPO is described as using the empirical entropy trajectory as an online phase signal, yet no ablation is reported that isolates the contribution of polarity-based branching versus simple entropy-target tracking. This leaves open whether the polarity construct itself is load-bearing for the claimed gains.
minor comments (3)
  1. [§3.1] Notation for token-level entropy H_i and polarity P_i is introduced without an explicit comparison table to prior global entropy measures (e.g., those in PPO or GRPO), which would help readers situate the new quantities.
  2. [Figure 3] Figure 3 caption does not state the number of tokens or trajectories aggregated; axis labels use inconsistent font sizes with the main text.
  3. [Abstract and §5] The abstract claims 'substantial reward improvements' but the main text does not quantify the absolute reward deltas or normalize them against the baseline variance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in both the theoretical analysis and empirical validation. We provide point-by-point responses below and commit to making the necessary revisions.

read point-by-point responses
  1. Referee: §3.2, Eq. (7): The first-order Taylor expansion for token-level entropy change ΔH_i is stated without the Lagrange remainder or any explicit bound on higher-order terms as a function of the local probability shift |Δπ_i|. No analysis is given for the regime of typical RLVR KL divergences or max-probability shifts where the linear term dominates, which is required for polarity to reliably predict sign and magnitude of entropy change.

    Authors: We agree that an explicit error analysis would strengthen the theoretical foundation. In the revised manuscript, we will derive the Lagrange remainder for the Taylor expansion of the entropy function and provide a bound on the higher-order terms in terms of |Δπ_i|. Furthermore, we will include an analysis of typical RLVR KL divergence values (commonly in the range of 0.01 to 0.05) and demonstrate, both theoretically and via additional experiments, that the first-order term dominates under these conditions, thereby validating the use of entropy polarity for sign prediction. revision: yes

  2. Referee: Experiments section, Tables 2–4: Reported performance gains for PAPO versus baselines are presented without error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the observed reward and efficiency improvements are stable or could be explained by variance in the RLVR training runs.

    Authors: This is a valid concern regarding the robustness of our empirical results. We will revise the experiments section to include results from multiple random seeds (at least three), report means with standard deviations as error bars in Tables 2-4, and perform statistical significance tests (e.g., paired t-tests) to confirm that the improvements are statistically significant and not due to training variance. revision: yes

  3. Referee: §4.3: The adaptive reweighting in PAPO is described as using the empirical entropy trajectory as an online phase signal, yet no ablation is reported that isolates the contribution of polarity-based branching versus simple entropy-target tracking. This leaves open whether the polarity construct itself is load-bearing for the claimed gains.

    Authors: We appreciate the suggestion to better isolate the effect of polarity. In the revised paper, we will add a new ablation experiment comparing the full PAPO method against a baseline that performs entropy-target tracking without the polarity-based branching mechanism. This will clarify the specific role of the polarity construct in achieving the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in entropy polarity derivation

full rationale

The paper's central derivation obtains entropy polarity from a first-order Taylor expansion of token-level entropy change under a sampled policy update. This is a standard analytic approximation whose linear term is defined directly from the entropy function and the probability shift; it is not obtained by fitting to the target entropy trajectory or by redefining the quantity in terms of itself. The subsequent empirical correlation checks and the PAPO adaptive reweighting (which uses the observed entropy trajectory only as an online phase signal) are downstream applications, not inputs that force the polarity definition. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The derivation chain is therefore mathematically self-contained and independent of the quantities it is later used to predict.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on a first-order approximation whose validity is assumed without independent verification in the provided abstract; entropy polarity is introduced as a new derived quantity; PAPO introduces adaptive reweighting whose parameters are not enumerated.

free parameters (1)
  • advantage reweighting coefficients
    Used to preserve both polarity branches; specific values or fitting procedure not stated in abstract.
axioms (1)
  • domain assumption First-order approximation sufficiently captures entropy change under sampled policy updates in RLVR
    Invoked to derive the polarity quantity from the entropy mechanics analysis.
invented entities (1)
  • entropy polarity no independent evidence
    purpose: Signed token-level predictor of entropy expansion or contraction
    New quantity introduced via the first-order approximation; no independent falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5601 in / 1427 out tokens · 43999 ms · 2026-05-15T06:00:17.786406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 14 internal anchors

  1. [1]

    , author=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. , author=. Nature , volume=

  2. [5]

    2026 , url=

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

  3. [6]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

  4. [9]

    POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

    An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

  5. [15]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for

    Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xiong-Hui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin , booktitle=. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement...

  6. [16]

    CoRR , volume =

    Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.15248 , eprinttype =. 2511.15248 , timestamp =

  7. [18]

    2026 , url=

    Chen Huang and Wei Lu and Wenxuan Zhang , booktitle=. 2026 , url=

  8. [19]

    Beyond Magnitude: Leveraging Direction of

    Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , booktitle=. Beyond Magnitude: Leveraging Direction of. 2026 , url=

  9. [20]

    Sparse but Critical: A Token-Level Analysis of Distributional Shifts in

    Haoming Meng and Kexin Huang and Shaohang Wei and Chiyu Ma and Shuo Yang and Xue Wang and Guoyin Wang and Bolin Ding and Jingren Zhou , booktitle=. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in. 2026 , url=

  10. [21]

    2026 , url=

    Zhiheng Xi and Xin Guo and Yang Nan and Enyu Zhou and Junrui Shen and Wenxiang Chen and Jiaqi Liu and Jixuan Huang and Xun Deng and Zhihao Zhang and Honglin Guo and Zhikai Lei and Miao Zheng and Guoteng Wang and Peng Sun and Rui Zheng and Hang Yan and Tao Gui and Qi Zhang and Xuanjing Huang , booktitle=. 2026 , url=

  11. [24]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  12. [25]

    The Twelfth International Conference on Learning Representations,

    Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  13. [26]

    Hugging Face repository , volume=

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

  14. [27]

    2024 , note=

    American invitational mathematics examination (aime) 2024 , author=. 2024 , note=

  15. [28]

    2025 , note=

    American invitational mathematics examination (aime) 2025 , author=. 2025 , note=

  16. [29]

    Advances in neural information processing systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

  17. [31]

    Forty-second International Conference on Machine Learning , year=

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

  18. [33]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    CRUXEval: a benchmark for code reasoning, understanding and execution , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  19. [34]

    The Thirteenth International Conference on Learning Representations , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

  20. [35]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  21. [41]

    CoRR , volume =

    Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22117 , eprinttype =. 2603.22117 , timestamp =

  22. [44]

    Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =

    Evan Zheran Liu and Aditi Raghunathan and Percy Liang and Chelsea Finn , editor =. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =. 2021 , url =

  23. [49]

    2026 , eprint=

    AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. 2026 , eprint=

  24. [52]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

    Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp.github.io/blog/2025/Polaris

  25. [53]

    Claude code, 2025

    Anthropic. Claude code, 2025. URL [https://docs.anthropic.com/en/docs/claude-code](https://docs.anthropic.com/en/docs/claude-code)

  26. [54]

    TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

    Sikai Bai, Haoxi Li, Jie Zhang, Yongjiang Liu, and Song Guo. Ttvs: Boosting self-exploring reinforcement learning via test-time variational synthesis. arXiv preprint arXiv:2604.08468, 2026

  27. [55]

    Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a

    Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851, 2025 a

  28. [56]

    Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, and Wenji Mao. Flexible entropy control in RLVR with gradient-preserving perspective. CoRR, abs/2602.09782, 2026. doi:10.48550/ARXIV.2602.09782. URL https://doi.org/10.48550/arXiv.2602.09782

  29. [57]

    Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward

    Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward. CoRR, abs/2512.16912, 2025 b . doi:10.48550/ARXIV.2512.16912. URL https://doi.org/10.48550/arXiv.2512.16912

  30. [58]

    Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms

    Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms. CoRR, abs/2512.00908, 2025 c . doi:10.48550/ARXIV.2512.00908. URL https://doi.org/10.48550/arXiv.2512.00908

  31. [59]

    Reasoning with exploration: An entropy perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors, Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium...

  32. [60]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models. CoRR, abs/2505.22617, 2025. doi:10.48550/ARXIV.2505.22617. URL https://doi...

  33. [61]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. CoRR, abs/2505.10978, 2025. doi:10.48550/ARXIV.2505.10978. URL https://doi.org/10.48550/arXiv.2505.10978

  34. [62]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xionghui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization. CoRR, abs/2511.20347, 2025. doi:10.48550/ARXIV.2511.20347. URL https://doi.org/10.48550/arXiv.2511.20347

  35. [63]

    Cruxeval: a benchmark for code reasoning, understanding and execution

    Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, pages 16568--16621, 2024

  36. [64]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    D Guo, D Yang, H Zhang, J Song, P Wang, Q Zhu, R Xu, R Zhang, S Ma, X Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

  37. [65]

    Justrl: Scaling a 1.5 b llm with a simple rl recipe

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe. arXiv preprint arXiv:2512.16649, 2025

  38. [66]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  39. [67]

    Beyond magnitude: Leveraging direction of RLVR updates for LLM reasoning

    Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. Beyond magnitude: Leveraging direction of RLVR updates for LLM reasoning. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=r6Pw3RiMYL

  40. [68]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  41. [69]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13 0 (9): 0 9, 2024

  42. [70]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

  43. [71]

    Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices

    Evan Zheran Liu, Aditi Raghunathan, Percy Liang, and Chelsea Finn. Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , Proceedings of Machine Learning Research, p...

  44. [72]

    Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLM s

    Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in RLVR fine-tuning of LLM s. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=8vWIXno8LW

  45. [73]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

  46. [74]

    a henb \

    Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco F. Cusumano - Towner, Raja Giryes, and Philipp Kr \" a henb \" u hl. Entropy-preserving reinforcement learning. CoRR, abs/2603.11682, 2026. doi:10.48550/ARXIV.2603.11682. URL https://doi.org/10.48550/arXiv.2603.11682

  47. [75]

    arXiv preprint arXiv:2505.22660

    Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Maximizing confidence alone improves reasoning. CoRR, abs/2505.22660, 2025. doi:10.48550/ARXIV.2505.22660. URL https://doi.org/10.48550/arXiv.2505.22660

  48. [76]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  49. [77]

    On entropy control in LLM-RL algorithms

    Han Shen. On entropy control in LLM-RL algorithms. CoRR, abs/2509.03493, 2025. doi:10.48550/ARXIV.2509.03493. URL https://doi.org/10.48550/arXiv.2509.03493

  50. [78]

    CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

    Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. CE-GPPO: coordinating entropy via gradient-preserving clipping policy optimization in reinforcement learning. CoRR, abs/2509.20712, 2025. doi:10.48550/ARXIV.2509.20712. URL https://doi.org/10.48550/arXiv.2509.20712

  51. [79]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. IEEE Trans. Neural Networks , 9 0 (5): 0 1054--1054, 1998. doi:10.1109/TNN.1998.712192. URL https://doi.org/10.1109/TNN.1998.712192

  52. [80]

    Rethinking sample polarity in reinforcement learning with verifiable rewards

    Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Rethinking sample polarity in reinforcement learning with verifiable rewards. CoRR, abs/2512.21625, 2025. doi:10.48550/ARXIV.2512.21625. URL https://doi.org/10.48550/arXiv.2512.21625

  53. [81]

    Skip-Connected Policy Optimization for Implicit Advantage

    Fengwei Teng, Jinyi Bai, Xinhao Yao, Demi Ruohan Wang, Jiahao Zhao, and Zhijiang Guo. Skip-connected policy optimization for implicit advantage. arXiv preprint arXiv:2604.08690, 2026

  54. [82]

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. In The Thirty-ninth Annual Conf...

  55. [83]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024

  56. [84]

    Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

    Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, and Yu - Gang Jiang. Agentgym-rl: Training LLM agents for long-horizon decision making ...

  57. [85]

    Can RL improve generalization of LLM agents? an empirical study

    Zhiheng Xi, Xin Guo, Jiaqi Liu, Jiazheng Zhang, Yutao Fan, Zhihao Zhang, Shichun Liu, Mingxu Chai, Xiaowei Shi, Yitao Zhai, Xunliang Cai, Tao Gui, Qi Zhang, and Xuanjing Huang. Can RL improve generalization of LLM agents? an empirical study. CoRR, abs/2603.12011, 2026 a . doi:10.48550/ARXIV.2603.12011. URL https://doi.org/10.48550/arXiv.2603.12011

  58. [86]

    BAPO : Stabilizing off-policy reinforcement learning for LLM s via balanced policy optimization with adaptive clipping

    Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Xun Deng, Zhihao Zhang, Honglin Guo, Zhikai Lei, Miao Zheng, Guoteng Wang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, and Xuanjing Huang. BAPO : Stabilizing off-policy reinforcement learning for LLM s via balanced policy optimization with adaptive clippin...

  59. [87]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  60. [88]

    DAPO : An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  61. [90]

    AgentV-RL: Scaling Reward Modeling with Agentic Verifier

    Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, et al. Agentv-rl: Scaling reward modeling with agentic verifier. arXiv preprint arXiv:2604.16004, 2026 b

  62. [91]

    A survey of reinforcement learning for large reasoning models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

  63. [92]

    Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

    Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning. arXiv preprint arXiv:2505.00024, 2025 b

  64. [93]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024. Contest problem collection

  65. [94]

    Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective

    Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, et al. Why reinforcement fine-tuning enables mllms preserve prior knowledge better: A data perspective. arXiv preprint arXiv:2506.23508, 2025 c

  66. [95]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

  67. [96]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023