Holder Policy Optimisation

Chenyang Le; Dingli Liang; Jiachen Zhu; Jianghao Lin; Jun Wang; Lingyu Yang; Weinan Zhang; Yihang Chen; Yuxiang Chen; Zhaokai Wang

arxiv: 2605.12058 · v2 · pith:5T3JCLYXnew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Holder Policy Optimisation

Yuxiang Chen , Dingli Liang , Yihang Chen , Ziqin Gong , Chenyang Le , Zhaokai Wang , Jiachen Zhu , Lingyu Yang

show 3 more authors

Jianghao Lin Weinan Zhang Jun Wang

This is my paper

Pith reviewed 2026-05-22 09:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords policy optimizationHölder meantoken aggregationreinforcement learninglarge language modelsgradient variancetraining stabilitymathematical reasoning

0 comments

The pith

A tunable exponent in the Hölder mean unifies token aggregation for group relative policy optimization and supplies continuous control over gradient concentration versus variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group relative policy optimization estimates advantages from multiple sampled trajectories but must still combine token-level probabilities inside each sequence. Fixed aggregation choices create a recurring problem in which some options trigger training collapse while others produce only mediocre final performance. The paper replaces those fixed choices with the Hölder mean and shows that its single exponent p can be varied to move continuously between concentrating gradients on sparse signals and strictly limiting their variance. Because any single fixed p leaves part of the trade-off unresolved, the authors add a schedule that gradually changes p across the entire training run. The resulting method improves stability and reaches higher accuracy on mathematical reasoning tasks and higher success rates on agent environments.

Core claim

Token-level probability aggregation inside GRPO can be performed by the Hölder mean whose exponent p directly governs the concentration-stability trade-off: larger p concentrates the gradient to strengthen learning from infrequent high-signal tokens, while smaller p bounds gradient variance to prevent collapse; because no static p works for the whole training process, a dynamic annealing schedule that lowers p over time produces measurably better convergence, evidenced by a 54.9 percent average accuracy across mathematical benchmarks (7.2 percent relative gain over standard GRPO) and a 93.8 percent success rate on ALFWorld.

What carries the argument

The Hölder mean applied to token probabilities inside each trajectory, with the exponent p serving as the single continuous knob that trades gradient concentration against variance bounds.

If this is right

Larger p values concentrate gradients on the most probable or informative tokens, amplifying sparse learning signals.
Smaller p values tighten upper bounds on gradient variance, reducing the chance of training collapse.
A schedule that begins with higher p and lowers it over time first encourages signal amplification then enforces stability.
The combined framework yields a 54.9 percent average accuracy on multiple mathematical benchmarks, a 7.2 percent relative improvement over standard GRPO.
The same approach reaches a 93.8 percent success rate on the ALFWorld environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tunable-mean idea could be tested in other sequence-level reinforcement learning settings that currently rely on mean or max aggregation.
Optimal annealing schedules might turn out to depend on model scale or task difficulty, offering a new hyper-parameter dimension to explore.
If similar concentration-variance tensions appear in value-function estimation or advantage normalization, replacing fixed rules with scheduled Hölder means could be a general pattern.

Load-bearing premise

That the observed instability and performance limits arise chiefly from the choice of a fixed aggregation rule and can be corrected by varying one exponent in the mean without introducing new failure modes or discarding useful information.

What would settle it

An experiment in which the dynamic annealing schedule for p produces final accuracy or success rates no higher than the strongest fixed-aggregation baseline on the same mathematical and ALFWorld evaluations.

Figures

Figures reproduced from arXiv: 2605.12058 by Chenyang Le, Dingli Liang, Jiachen Zhu, Jianghao Lin, Jun Wang, Lingyu Yang, Weinan Zhang, Yihang Chen, Yuxiang Chen, Zhaokai Wang, Ziqin Gong.

**Figure 1.** Figure 1: HölderPO unifies token-level aggregation under a single parameter p. The objective at the top generalises GRPO by replacing its arithmetic mean over token-level importance ratios with the Hölder mean of order p ∈ R, recovering GRPO (p = 1) and GMPO/GSPO (p → 0) as special cases. The bar chart reports accuracy on AIME24 (blue, sparse signal) and MATH500 (red, dense signal), with dashed lines marking GRPO ba… view at source ↗

**Figure 2.** Figure 2: Token-level importance ratio log ρt(θ) during training. Left and Right track the per-step upper and lower envelopes respectively. As p decreases, the upper envelope drops and the lower envelope rises, tightening the gap monotonically. Our decaying schedule p: 2→−2 (solid green) thus enables aggressive updates in the early stage and progressively converges to stable optimization in the later stage. Constant… view at source ↗

**Figure 3.** Figure 3: Entropy and gradient-norm dynamics under different Hölder exponents p. Columns: Math (Qwen2.5-Math-7B on MATH-12k) and Alfworld (Qwen2.5-1.5B). Rows: per-step policy entropy and gradient norm ∥∇L∥ (log scale on Math, linear on Alfworld). Constant-p baselines (p ∈ {+2, 0, −2}, dashed/dotted/dash-dotted) are compared with our linearly-decaying schedule p: 2→−2 (solid green). Positive p concentrates mass on h… view at source ↗

read the original abstract

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{H\"{o}lderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the H\"{o}lder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HölderPO generalizes token aggregation in GRPO via the Hölder mean with a dynamic p schedule, delivering measurable gains on math and agent tasks but resting on an untested claim that dynamism beats the best fixed p.

read the letter

HölderPO generalizes how token-level probabilities get aggregated inside GRPO-style updates by using the Hölder mean instead of a single fixed function. The exponent p becomes a continuous control that trades off gradient concentration against variance, and they add a dynamic annealing schedule that changes p over the course of training. That is the actual novelty here; prior GRPO work picked one aggregator and stuck with it.

Referee Report

2 major / 2 minor

Summary. The paper introduces HölderPO, a generalized policy optimization framework for LLMs that replaces fixed aggregation of token-level probabilities in GRPO with the Hölder mean. Modulating the exponent p is claimed to continuously control the trade-off between gradient concentration (larger p) and variance bounds (smaller p), with theoretical proofs provided for these properties. Since no fixed p resolves the trade-off universally, a dynamic annealing schedule for p is proposed and evaluated, yielding 54.9% average accuracy on mathematical benchmarks (7.2% relative gain over GRPO) and 93.8% success on ALFWorld.

Significance. If the theoretical bounds and the necessity of the dynamic schedule hold, the framework provides a flexible, theoretically grounded mechanism for balancing stability and signal strength in group-based policy optimization. The empirical gains are substantial and could influence RL methods for LLMs, particularly if the approach generalizes beyond the reported tasks and the annealing schedule demonstrably outperforms exhaustively tuned static p values.

major comments (2)

The abstract asserts that 'no static configuration can universally resolve this concentration-stability trade-off' and motivates the dynamic annealing algorithm on this basis, yet no ablation is described comparing the dynamic schedule against the best fixed p found via grid search over the same range and benchmarks. This leaves open whether the reported 54.9% accuracy and 93.8% ALFWorld success are driven by dynamism itself or could be matched by a single well-chosen static p.
Theoretical claims of concentration for large p and strict variance bounds for small p are stated, but the manuscript provides no derivation steps, intermediate lemmas, or explicit assumptions (e.g., on the distribution of advantages or trajectory lengths) that would allow verification of the bounds' tightness or applicability to the GRPO setting.

minor comments (2)

The reported accuracy figures lack error bars, number of random seeds, or ablation details on hyperparameter sensitivity, which weakens confidence in the stability claims.
Notation for the Hölder mean and its relation to token-level probability aggregation should be introduced with an explicit equation early in the paper for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for strengthening the presentation of both the empirical and theoretical contributions. We address each major comment below and will incorporate revisions to improve clarity and completeness.

read point-by-point responses

Referee: The abstract asserts that 'no static configuration can universally resolve this concentration-stability trade-off' and motivates the dynamic annealing algorithm on this basis, yet no ablation is described comparing the dynamic schedule against the best fixed p found via grid search over the same range and benchmarks. This leaves open whether the reported 54.9% accuracy and 93.8% ALFWorld success are driven by dynamism itself or could be matched by a single well-chosen static p.

Authors: We agree that an explicit ablation comparing the dynamic annealing schedule to the best fixed p (identified via grid search over the same range and benchmarks) would provide stronger evidence for the necessity of dynamism. The current manuscript relies on empirical observations that certain fixed p values lead to collapse while others yield limited gains, but does not report a comprehensive grid-search comparison. We will add this ablation study in the revised version, including results for multiple fixed p values and a direct comparison to the annealing schedule on the mathematical benchmarks and ALFWorld. revision: yes
Referee: Theoretical claims of concentration for large p and strict variance bounds for small p are stated, but the manuscript provides no derivation steps, intermediate lemmas, or explicit assumptions (e.g., on the distribution of advantages or trajectory lengths) that would allow verification of the bounds' tightness or applicability to the GRPO setting.

Authors: The theoretical properties are derived in Section 3 using the Hölder mean applied to token-level probabilities within the GRPO gradient estimator. However, we acknowledge that the main text presents the results at a high level without full intermediate steps or explicit assumptions. We will expand the appendix with complete derivation details, including all lemmas, the precise assumptions on advantage distributions and trajectory lengths, and discussion of bound tightness to enable verification in the GRPO setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent theoretical proofs and empirical observations

full rationale

The paper's core chain begins with empirical observation of trade-offs in fixed aggregations, followed by a theoretical proof that larger/smaller p controls gradient concentration vs. variance bounds via the Hölder mean. The dynamic annealing schedule is then instantiated as a practical response to the stated limitation of static p. These steps do not reduce by construction to fitted inputs or self-definitions; the proofs are presented as first-principles results, and performance claims are benchmarked externally rather than tautologically derived from the schedule itself. No self-citation load-bearing or ansatz smuggling is evident in the provided derivation outline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical properties of the Hölder mean for bounding gradients and on the empirical observation that fixed aggregations produce a collapse-versus-performance trade-off; no new physical entities are postulated.

free parameters (1)

p (Hölder exponent)
The exponent is modulated continuously and annealed across training; its specific schedule is chosen to balance concentration and variance and is therefore a free parameter of the method.

axioms (1)

standard math The Hölder mean provides a continuous family of aggregation functions whose gradient concentration and variance can be strictly bounded by the choice of p.
Invoked when the paper states that larger p concentrates the gradient and smaller p bounds variance.

pith-pipeline@v0.9.0 · 5808 in / 1461 out tokens · 35207 ms · 2026-05-22T09:57:03.391201+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HölderPO generalises the token-level aggregation by the Hölder mean of order p: ρ_{i,p}(θ) = (1/|y_i| ∑ r_{i,t}^p)^{1/p} (p≠0) ... p=1 recovers GRPO, p=0 recovers GSPO
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1: Shannon entropy of W_p attains global maximum at p=0 and strictly decreases as |p| increases; p→±∞ concentrates on argmax/argmin ratios

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 22 internal anchors

[1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[4]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[5]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Geometric-mean policy optimization , author=. arXiv preprint arXiv:2507.20673 , year=

work page arXiv
[8]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[11]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

work page doi:10.18653/v1/2024.acl-long.211 2024
[12]

2024 , publisher =

Jia, LI and Edward, Beeching and Lewis, Tunstall and Ben, Lipkin and Roman, Soletskyi and Shengyi Costa, Huang and Kashif, Rasul and Longhui, Yu and Albert, Jiang and Ziju, Shen and Zihan, Qin and Bin, Dong and Li, Zhou and Yann, Fleureau and Guillaume, Lample and Stanislas, Polu , title =. 2024 , publisher =

work page 2024
[13]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[15]

Advances in Neural Information Processing Systems (NIPS) , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

work page
[16]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Group-in-Group Policy Optimization for LLM Agent Training

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[19]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms , author=. arXiv preprint arXiv:2506.14245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2504.02546 , year=

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=

work page arXiv
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum , author=. arXiv preprint arXiv:2505.14264 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2506.02864 , year=

Bnpo: Beta normalization policy optimization , author=. arXiv preprint arXiv:2506.02864 , year=

work page arXiv
[27]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. arXiv preprint arXiv:2503.01491 , year=

work page arXiv
[30]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Token-level proximal policy optimization for query generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[31]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author=. arXiv preprint arXiv:2503.18892 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2404.02078 , year=

Advancing llm reasoning generalists with preference trees , author=. arXiv preprint arXiv:2404.02078 , year=

work page arXiv
[33]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

work page 2018
[34]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[35]

1976 , publisher=

Principles of Mathematical Analysis , author=. 1976 , publisher=

work page 1976
[36]

arXiv preprint arXiv:2601.22521 , year=

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry , author=. arXiv preprint arXiv:2601.22521 , year=

work page arXiv
[37]

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models , author=. arXiv preprint arXiv:2603.28204 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2508.03772 , year=

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control , author=. arXiv preprint arXiv:2508.03772 , year=

work page arXiv
[39]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

arXiv preprint arXiv:2506.08440 , year=

Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization , author=. arXiv preprint arXiv:2506.08440 , year=

work page arXiv
[41]

arXiv preprint arXiv:2505.12929 , year=

Do not let low-probability tokens over-dominate in rl for llms , author=. arXiv preprint arXiv:2505.12929 , year=

work page arXiv
[42]

arXiv preprint arXiv:2510.03669 , year=

Token hidden reward: Steering exploration-exploitation in group relative deep reinforcement learning , author=. arXiv preprint arXiv:2510.03669 , year=

work page arXiv
[43]

arXiv preprint arXiv:2510.09369 , year=

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood , author=. arXiv preprint arXiv:2510.09369 , year=

work page arXiv
[44]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

The Twelfth International Conference on Learning Representations , year=

Rain: Your language models can align themselves without finetuning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[46]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in

work page
[47]

Rewarding the Unlikely: Lifting

He, Andre Wang and Fried, Daniel and Welleck, Sean , booktitle=. Rewarding the Unlikely: Lifting. 2025 , publisher=

work page 2025
[48]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

Let's Verify Step by Step

Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[51]

arXiv preprint arXiv:2510.06870 , year=

lambda -GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences , author=. arXiv preprint arXiv:2510.06870 , year=

work page arXiv
[52]

arXiv preprint arXiv:2505.23585 , year=

On-policy rl with optimal reward baseline , author=. arXiv preprint arXiv:2505.23585 , year=

work page arXiv
[53]

arXiv preprint arXiv:2505.12346 , year=

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author=. arXiv preprint arXiv:2505.12346 , year=

work page arXiv
[54]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[55]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[56]

Spurious Rewards: Rethinking Training Signals in RLVR

Spurious Rewards: Rethinking Training Signals in RLVR , author=. arXiv preprint arXiv:2506.10947 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018
[58]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[59]

Transformer Circuits Thread , year=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=

work page

[1] [1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[4] [4]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[5] [5]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Geometric-mean policy optimization , author=. arXiv preprint arXiv:2507.20673 , year=

work page arXiv

[8] [8]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page

[11] [11]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...

work page doi:10.18653/v1/2024.acl-long.211 2024

[12] [12]

2024 , publisher =

Jia, LI and Edward, Beeching and Lewis, Tunstall and Ben, Lipkin and Roman, Soletskyi and Shengyi Costa, Huang and Kashif, Rasul and Longhui, Yu and Albert, Jiang and Ziju, Shen and Zihan, Qin and Bin, Dong and Li, Zhou and Yann, Fleureau and Guillaume, Lample and Stanislas, Polu , title =. 2024 , publisher =

work page 2024

[13] [13]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992

[15] [15]

Advances in Neural Information Processing Systems (NIPS) , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in Neural Information Processing Systems (NIPS) , volume=

work page

[16] [16]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Group-in-Group Policy Optimization for LLM Agent Training

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page

[19] [19]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms , author=. arXiv preprint arXiv:2506.14245 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

arXiv preprint arXiv:2504.02546 , year=

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=

work page arXiv

[24] [24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum , author=. arXiv preprint arXiv:2505.14264 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2506.02864 , year=

Bnpo: Beta normalization policy optimization , author=. arXiv preprint arXiv:2506.02864 , year=

work page arXiv

[27] [27]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. arXiv preprint arXiv:2503.01491 , year=

work page arXiv

[30] [30]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Token-level proximal policy optimization for query generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[31] [31]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author=. arXiv preprint arXiv:2503.18892 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2404.02078 , year=

Advancing llm reasoning generalists with preference trees , author=. arXiv preprint arXiv:2404.02078 , year=

work page arXiv

[33] [33]

SIAM review , volume=

Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=

work page 2018

[34] [34]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015

[35] [35]

1976 , publisher=

Principles of Mathematical Analysis , author=. 1976 , publisher=

work page 1976

[36] [36]

arXiv preprint arXiv:2601.22521 , year=

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry , author=. arXiv preprint arXiv:2601.22521 , year=

work page arXiv

[37] [37]

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models , author=. arXiv preprint arXiv:2603.28204 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

arXiv preprint arXiv:2508.03772 , year=

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control , author=. arXiv preprint arXiv:2508.03772 , year=

work page arXiv

[39] [39]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

arXiv preprint arXiv:2506.08440 , year=

Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization , author=. arXiv preprint arXiv:2506.08440 , year=

work page arXiv

[41] [41]

arXiv preprint arXiv:2505.12929 , year=

Do not let low-probability tokens over-dominate in rl for llms , author=. arXiv preprint arXiv:2505.12929 , year=

work page arXiv

[42] [42]

arXiv preprint arXiv:2510.03669 , year=

Token hidden reward: Steering exploration-exploitation in group relative deep reinforcement learning , author=. arXiv preprint arXiv:2510.03669 , year=

work page arXiv

[43] [43]

arXiv preprint arXiv:2510.09369 , year=

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood , author=. arXiv preprint arXiv:2510.09369 , year=

work page arXiv

[44] [44]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page

[45] [45]

The Twelfth International Conference on Learning Representations , year=

Rain: Your language models can align themselves without finetuning , author=. The Twelfth International Conference on Learning Representations , year=

work page

[46] [46]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in

work page

[47] [47]

Rewarding the Unlikely: Lifting

He, Andre Wang and Fried, Daniel and Welleck, Sean , booktitle=. Rewarding the Unlikely: Lifting. 2025 , publisher=

work page 2025

[48] [48]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page

[49] [49]

Let's Verify Step by Step

Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[51] [51]

arXiv preprint arXiv:2510.06870 , year=

lambda -GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences , author=. arXiv preprint arXiv:2510.06870 , year=

work page arXiv

[52] [52]

arXiv preprint arXiv:2505.23585 , year=

On-policy rl with optimal reward baseline , author=. arXiv preprint arXiv:2505.23585 , year=

work page arXiv

[53] [53]

arXiv preprint arXiv:2505.12346 , year=

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author=. arXiv preprint arXiv:2505.12346 , year=

work page arXiv

[54] [54]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018

[55] [55]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[56] [56]

Spurious Rewards: Rethinking Training Signals in RLVR

Spurious Rewards: Rethinking Training Signals in RLVR , author=. arXiv preprint arXiv:2506.10947 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018

[58] [58]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[59] [59]

Transformer Circuits Thread , year=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=

work page