Holder Policy Optimisation
Pith reviewed 2026-05-22 09:57 UTC · model grok-4.3
The pith
A tunable exponent in the Hölder mean unifies token aggregation for group relative policy optimization and supplies continuous control over gradient concentration versus variance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Token-level probability aggregation inside GRPO can be performed by the Hölder mean whose exponent p directly governs the concentration-stability trade-off: larger p concentrates the gradient to strengthen learning from infrequent high-signal tokens, while smaller p bounds gradient variance to prevent collapse; because no static p works for the whole training process, a dynamic annealing schedule that lowers p over time produces measurably better convergence, evidenced by a 54.9 percent average accuracy across mathematical benchmarks (7.2 percent relative gain over standard GRPO) and a 93.8 percent success rate on ALFWorld.
What carries the argument
The Hölder mean applied to token probabilities inside each trajectory, with the exponent p serving as the single continuous knob that trades gradient concentration against variance bounds.
If this is right
- Larger p values concentrate gradients on the most probable or informative tokens, amplifying sparse learning signals.
- Smaller p values tighten upper bounds on gradient variance, reducing the chance of training collapse.
- A schedule that begins with higher p and lowers it over time first encourages signal amplification then enforces stability.
- The combined framework yields a 54.9 percent average accuracy on multiple mathematical benchmarks, a 7.2 percent relative improvement over standard GRPO.
- The same approach reaches a 93.8 percent success rate on the ALFWorld environment.
Where Pith is reading between the lines
- The same tunable-mean idea could be tested in other sequence-level reinforcement learning settings that currently rely on mean or max aggregation.
- Optimal annealing schedules might turn out to depend on model scale or task difficulty, offering a new hyper-parameter dimension to explore.
- If similar concentration-variance tensions appear in value-function estimation or advantage normalization, replacing fixed rules with scheduled Hölder means could be a general pattern.
Load-bearing premise
That the observed instability and performance limits arise chiefly from the choice of a fixed aggregation rule and can be corrected by varying one exponent in the mean without introducing new failure modes or discarding useful information.
What would settle it
An experiment in which the dynamic annealing schedule for p produces final accuracy or success rates no higher than the strongest fixed-aggregation baseline on the same mathematical and ALFWorld evaluations.
Figures
read the original abstract
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{H\"{o}lderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the H\"{o}lder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HölderPO, a generalized policy optimization framework for LLMs that replaces fixed aggregation of token-level probabilities in GRPO with the Hölder mean. Modulating the exponent p is claimed to continuously control the trade-off between gradient concentration (larger p) and variance bounds (smaller p), with theoretical proofs provided for these properties. Since no fixed p resolves the trade-off universally, a dynamic annealing schedule for p is proposed and evaluated, yielding 54.9% average accuracy on mathematical benchmarks (7.2% relative gain over GRPO) and 93.8% success on ALFWorld.
Significance. If the theoretical bounds and the necessity of the dynamic schedule hold, the framework provides a flexible, theoretically grounded mechanism for balancing stability and signal strength in group-based policy optimization. The empirical gains are substantial and could influence RL methods for LLMs, particularly if the approach generalizes beyond the reported tasks and the annealing schedule demonstrably outperforms exhaustively tuned static p values.
major comments (2)
- The abstract asserts that 'no static configuration can universally resolve this concentration-stability trade-off' and motivates the dynamic annealing algorithm on this basis, yet no ablation is described comparing the dynamic schedule against the best fixed p found via grid search over the same range and benchmarks. This leaves open whether the reported 54.9% accuracy and 93.8% ALFWorld success are driven by dynamism itself or could be matched by a single well-chosen static p.
- Theoretical claims of concentration for large p and strict variance bounds for small p are stated, but the manuscript provides no derivation steps, intermediate lemmas, or explicit assumptions (e.g., on the distribution of advantages or trajectory lengths) that would allow verification of the bounds' tightness or applicability to the GRPO setting.
minor comments (2)
- The reported accuracy figures lack error bars, number of random seeds, or ablation details on hyperparameter sensitivity, which weakens confidence in the stability claims.
- Notation for the Hölder mean and its relation to token-level probability aggregation should be introduced with an explicit equation early in the paper for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects for strengthening the presentation of both the empirical and theoretical contributions. We address each major comment below and will incorporate revisions to improve clarity and completeness.
read point-by-point responses
-
Referee: The abstract asserts that 'no static configuration can universally resolve this concentration-stability trade-off' and motivates the dynamic annealing algorithm on this basis, yet no ablation is described comparing the dynamic schedule against the best fixed p found via grid search over the same range and benchmarks. This leaves open whether the reported 54.9% accuracy and 93.8% ALFWorld success are driven by dynamism itself or could be matched by a single well-chosen static p.
Authors: We agree that an explicit ablation comparing the dynamic annealing schedule to the best fixed p (identified via grid search over the same range and benchmarks) would provide stronger evidence for the necessity of dynamism. The current manuscript relies on empirical observations that certain fixed p values lead to collapse while others yield limited gains, but does not report a comprehensive grid-search comparison. We will add this ablation study in the revised version, including results for multiple fixed p values and a direct comparison to the annealing schedule on the mathematical benchmarks and ALFWorld. revision: yes
-
Referee: Theoretical claims of concentration for large p and strict variance bounds for small p are stated, but the manuscript provides no derivation steps, intermediate lemmas, or explicit assumptions (e.g., on the distribution of advantages or trajectory lengths) that would allow verification of the bounds' tightness or applicability to the GRPO setting.
Authors: The theoretical properties are derived in Section 3 using the Hölder mean applied to token-level probabilities within the GRPO gradient estimator. However, we acknowledge that the main text presents the results at a high level without full intermediate steps or explicit assumptions. We will expand the appendix with complete derivation details, including all lemmas, the precise assumptions on advantage distributions and trajectory lengths, and discussion of bound tightness to enable verification in the GRPO setting. revision: yes
Circularity Check
No significant circularity; derivation relies on independent theoretical proofs and empirical observations
full rationale
The paper's core chain begins with empirical observation of trade-offs in fixed aggregations, followed by a theoretical proof that larger/smaller p controls gradient concentration vs. variance bounds via the Hölder mean. The dynamic annealing schedule is then instantiated as a practical response to the stated limitation of static p. These steps do not reduce by construction to fitted inputs or self-definitions; the proofs are presented as first-principles results, and performance claims are benchmarked externally rather than tautologically derived from the schedule itself. No self-citation load-bearing or ansatz smuggling is evident in the provided derivation outline.
Axiom & Free-Parameter Ledger
free parameters (1)
- p (Hölder exponent)
axioms (1)
- standard math The Hölder mean provides a continuous family of aggregation functions whose gradient concentration and variance can be strictly bounded by the choice of p.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HölderPO generalises the token-level aggregation by the Hölder mean of order p: ρ_{i,p}(θ) = (1/|y_i| ∑ r_{i,t}^p)^{1/p} (p≠0) ... p=1 recovers GRPO, p=0 recovers GSPO
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1: Shannon entropy of W_p attains global maximum at p=0 and strictly decreases as |p| increases; p→±∞ concentrates on argmax/argmin ratios
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
- [4]
-
[5]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Understanding R1-Zero-Like Training: A Critical Perspective
Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025
Geometric-mean policy optimization , author=. arXiv preprint arXiv:2507.20673 , year=
-
[8]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. arXiv preprint arXiv:2409.12122 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Advances in neural information processing systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
-
[11]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong. O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. Proceedings of the ...
-
[12]
Jia, LI and Edward, Beeching and Lewis, Tunstall and Ben, Lipkin and Roman, Soletskyi and Shengyi Costa, Huang and Kashif, Rasul and Longhui, Yu and Albert, Jiang and Ziju, Shen and Zihan, Qin and Bin, Dong and Li, Zhou and Yann, Fleureau and Guillaume, Lample and Stanislas, Polu , title =. 2024 , publisher =
work page 2024
-
[13]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[14]
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=
work page 1992
-
[15]
Advances in Neural Information Processing Systems (NIPS) , volume=
Policy gradient methods for reinforcement learning with function approximation , author=. Advances in Neural Information Processing Systems (NIPS) , volume=
-
[16]
Group Sequence Policy Optimization
Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Group-in-Group Policy Optimization for LLM Agent Training
Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Advances in neural information processing systems , volume=
Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
-
[19]
Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms , author=. arXiv preprint arXiv:2506.14245 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2504.02546 , year=
Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=
-
[24]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin
AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Momentum , author=. arXiv preprint arXiv:2505.14264 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
arXiv preprint arXiv:2506.02864 , year=
Bnpo: Beta normalization policy optimization , author=. arXiv preprint arXiv:2506.02864 , year=
-
[27]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Process Reinforcement through Implicit Rewards
Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. arXiv preprint arXiv:2503.01491 , year=
-
[30]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Token-level proximal policy optimization for query generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[31]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author=. arXiv preprint arXiv:2503.18892 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
arXiv preprint arXiv:2404.02078 , year=
Advancing llm reasoning generalists with preference trees , author=. arXiv preprint arXiv:2404.02078 , year=
-
[33]
Optimization methods for large-scale machine learning , author=. SIAM review , volume=. 2018 , publisher=
work page 2018
-
[34]
International conference on machine learning , pages=
Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
- [35]
-
[36]
arXiv preprint arXiv:2601.22521 , year=
One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry , author=. arXiv preprint arXiv:2601.22521 , year=
-
[37]
ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models , author=. arXiv preprint arXiv:2603.28204 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
arXiv preprint arXiv:2508.03772 , year=
Gtpo: Stabilizing group relative policy optimization via gradient and entropy control , author=. arXiv preprint arXiv:2508.03772 , year=
-
[39]
Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
arXiv preprint arXiv:2506.08440 , year=
Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization , author=. arXiv preprint arXiv:2506.08440 , year=
-
[41]
arXiv preprint arXiv:2505.12929 , year=
Do not let low-probability tokens over-dominate in rl for llms , author=. arXiv preprint arXiv:2505.12929 , year=
-
[42]
arXiv preprint arXiv:2510.03669 , year=
Token hidden reward: Steering exploration-exploitation in group relative deep reinforcement learning , author=. arXiv preprint arXiv:2510.03669 , year=
-
[43]
arXiv preprint arXiv:2510.09369 , year=
Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood , author=. arXiv preprint arXiv:2510.09369 , year=
-
[44]
Advances in Neural Information Processing Systems , volume=
Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
The Twelfth International Conference on Learning Representations , year=
Rain: Your language models can align themselves without finetuning , author=. The Twelfth International Conference on Learning Representations , year=
-
[46]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in
-
[47]
Rewarding the Unlikely: Lifting
He, Andre Wang and Fried, Daniel and Welleck, Sean , booktitle=. Rewarding the Unlikely: Lifting. 2025 , publisher=
work page 2025
-
[48]
Advances in Neural Information Processing Systems , volume=
Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[49]
Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Advances in Neural Information Processing Systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[51]
arXiv preprint arXiv:2510.06870 , year=
lambda -GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences , author=. arXiv preprint arXiv:2510.06870 , year=
-
[52]
arXiv preprint arXiv:2505.23585 , year=
On-policy rl with optimal reward baseline , author=. arXiv preprint arXiv:2505.23585 , year=
-
[53]
arXiv preprint arXiv:2505.12346 , year=
Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author=. arXiv preprint arXiv:2505.12346 , year=
-
[54]
Reinforcement learning: An introduction , author=. 2018 , publisher=
work page 2018
-
[55]
International conference on machine learning , pages=
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=
work page 2018
-
[56]
Spurious Rewards: Rethinking Training Signals in RLVR
Spurious Rewards: Rethinking Training Signals in RLVR , author=. arXiv preprint arXiv:2506.10947 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=
work page 2018
-
[58]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[59]
Transformer Circuits Thread , year=
Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.