Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Clive Bai; Heming Zou; Kai Yang; Lizhou Cai; Qi Wang; Saiyong Yang; Weijie Liu; Wutong Xu; Xiangyang Ji; Yangkun Chen

arxiv: 2605.06139 · v2 · pith:BCAIFHOPnew · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Yun Qu , Qi Wang , Yixiu Mao , Heming Zou , Yuhang Jiang , Yingyue Li , Wutong Xu , Lizhou Cai

show 6 more authors

Weijie Liu Clive Bai Kai Yang Yangkun Chen Saiyong Yang Xiangyang Ji

This is my paper

Pith reviewed 2026-05-21 08:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Listwise Policy OptimizationReinforcement Learning with Verifiable RewardsGroup-based Policy GradientResponse SimplexTarget ProjectionLLM Post-trainingPolicy Optimization

0 comments

The pith

Group-based policy gradients for LLM reasoning implicitly project toward a target on the response simplex, and making this explicit yields monotonic listwise improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that prevalent group-based policy gradient methods in reinforcement learning with verifiable rewards all share a hidden geometric pattern: each one selects an implicit target distribution over possible responses to a prompt and approximates movement toward that target with a first-order update. Listwise Policy Optimization makes this structure explicit by confining the proximal objective to the simplex of response probabilities and then minimizing a chosen divergence exactly rather than approximately. The separation produces updates that steadily raise the listwise objective while keeping gradients bounded, zero-sum, and self-correcting. Experiments across reasoning tasks and model sizes show that the resulting method outperforms matched baselines while retaining response variety and training stability.

Core claim

Existing group-based policy gradients in RLVR each implicitly define a target distribution on the response simplex and project the policy toward it via first-order approximation. Listwise Policy Optimization instead restricts the proximal RL objective to the response simplex and performs the projection through exact divergence minimization, which demystifies the target and supplies monotonic improvement on the listwise objective together with bounded, zero-sum, self-correcting gradients and flexible divergence choice.

What carries the argument

The decoupled projection step on the response simplex, which separates target definition from the policy update and replaces first-order approximation with exact divergence minimization.

If this is right

The listwise objective improves monotonically under the projected updates.
Projection gradients remain bounded, sum to zero, and self-correct over steps.
Different divergence functions can be substituted in the projection step, each carrying distinct structural properties.
Performance on reasoning benchmarks rises relative to standard group-based baselines while response diversity and optimization stability are preserved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric framing could be used to construct new targets that address specific failure modes such as repetitive or off-topic reasoning steps.
The same simplex-projection idea might apply to other sequential generation settings beyond language, such as code or structured output.
Stability benefits may derive more from the exact projection mechanics than from the particular form of the advantage signal.

Load-bearing premise

That the common geometric structure of group-based methods is accurately described by restricting the proximal objective to the response simplex and that exact divergence minimization produces the claimed stability and performance properties.

What would settle it

A direct comparison of training curves in which the explicit projection step is removed while the same group-relative target is retained, checking whether monotonic improvement and gradient boundedness disappear.

Figures

Figures reproduced from arXiv: 2605.06139 by Clive Bai, Heming Zou, Kai Yang, Lizhou Cai, Qi Wang, Saiyong Yang, Weijie Liu, Wutong Xu, Xiangyang Ji, Yangkun Chen, Yingyue Li, Yixiu Mao, Yuhang Jiang, Yun Qu.

**Figure 1.** Figure 1: LPO iteratively ascends the reward landscape via explicit targetprojection, enabling stable optimization and flexible divergence design. Recent advances have revealed the prominent potential of reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) post-training, which incentivizes reasoning capabilities on complex problem-solving tasks (Guo et al., 2025; Jaech et al., 20… view at source ↗

**Figure 2.** Figure 2: Illustration of LPO, which performs explicit target projection on the LLM response simplex, in contrast to view at source ↗

**Figure 3.** Figure 3: Training curves of Pass@1 accuracy. Two LPO variants ( view at source ↗

**Figure 3.** Figure 3: Training curves of Pass@1 accuracy. Two LPO variants ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Pass@k training curves. LPO variants (LPOfwd, LPOrev) are evaluated against group-based PG baselines (GRPO, Dr.GRPO, MaxRL, shown from top to bottom) across various LLMs and tasks under paired temperature settings. Specific k configurations are detailed per benchmark. 5.2 Training Performance Performance gains. Under paired temperature configurations, LPO consistently outperforms group-based PG baselines.… view at source ↗

**Figure 5.** Figure 5: Training dynamics of LPO variants and GRPO. Rows from top to bottom respectively show the curves of view at source ↗

**Figure 6.** Figure 6: Ablation comparing listwise LPO with point view at source ↗

**Figure 6.** Figure 6: Ablation comparing listwise LPO with point [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Scalability validation. We compare LPO with GRPO by training Qwen3-14B-Base on the larger Polaris view at source ↗

**Figure 9.** Figure 9: Training dynamics of LPO variants and Dr.GRPO. Rows from top to bottom respectively show the curves view at source ↗

**Figure 10.** Figure 10: Training dynamics of LPO variants and MaxRL. Rows from top to bottom respectively show the curves view at source ↗

**Figure 11.** Figure 11: Generalization of LPO across diverse LLM families. Performance is evaluated on Countdown using view at source ↗

**Figure 12.** Figure 12: Empirical evaluation on the Countdown task under a fully on-policy regime (one gradient update per view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPO gives an explicit geometric framing for group-based RLVR updates but the claimed monotonicity and stability rest on an equivalence that may not hold exactly under the simplex restriction.

read the letter

The main takeaway is that this paper reinterprets common group-based policy gradients in RLVR as implicit target projections on the response simplex, then builds an explicit Listwise Policy Optimization (LPO) that performs the projection through exact divergence minimization rather than a first-order step. That unification and the decoupled projection step are the clearest new pieces. The experiments report consistent gains over standard baselines on reasoning tasks across several LLM backbones, while keeping response diversity and avoiding obvious instability. Those results are the practical hook. The geometric view also explains why some existing group-relative methods behave the way they do, which is useful for people already working in this area. The soft spots sit in the central claim. The stress-test concern is fair: restricting the proximal objective to the simplex and switching to exact minimization may shift the effective target or the gradient structure compared with the original group-relative advantage signals, especially once advantage normalization and token-level parameterization are taken into account. Without the full derivations it is difficult to tell whether the monotonic improvement and bounded zero-sum properties actually survive that change or whether they are artifacts of the first-order approximation the paper is trying to replace. The abstract is also thin on proof details and hyperparameter sensitivity, so the stability story needs more scrutiny. This is aimed at groups doing RL post-training for LLM reasoning. Readers who care about optimization geometry or alternatives to GRPO-style methods will get the most out of it. The idea is relevant enough and the empirical signal is positive enough that it deserves a serious referee, even if the theory section will probably need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript reinterprets group-based policy gradients in RLVR for LLMs as implicit first-order target projections onto the response simplex. It proposes Listwise Policy Optimization (LPO), which explicitly restricts the proximal RL objective to the simplex and performs exact divergence minimization to the target distribution derived from group-relative advantages. The authors claim this yields monotonic improvement on the listwise objective, bounded/zero-sum/self-correcting gradients, flexibility in divergence choice, and empirical gains over standard policy-gradient baselines on reasoning benchmarks while preserving stability and diversity.

Significance. If the geometric equivalence and transfer of stability guarantees hold, LPO would supply a principled, divergence-flexible framework that unifies and improves upon prevalent group-based RLVR methods. The explicit projection step could enable more stable post-training of reasoning LLMs and reduce reliance on ad-hoc advantage normalization.

major comments (2)

[§3.2] §3.2 and Eq. (8)–(11): the claimed exact recovery of group-relative policy gradients as the first-order approximation to simplex-restricted divergence minimization is not shown to hold when advantages are computed over token sequences rather than whole responses; the normalization across the sampled group appears to introduce an effective target shift that is not accounted for in the projection step.
[Theorem 1] Theorem 1 (monotonic improvement): the proof assumes the target distribution remains fixed during the exact projection, but the group-relative advantage used to define the target is itself recomputed from the current policy samples; this creates a moving-target issue that may invalidate the monotonicity guarantee unless an additional contraction argument is supplied.

minor comments (2)

[§2] Notation for the response simplex and the projection operator is introduced without an explicit definition of the ambient probability space over variable-length sequences.
[§5] Experimental section compares LPO only against matched-target baselines; an ablation varying the divergence (KL vs. reverse KL vs. Jensen-Shannon) would strengthen the claim of structural flexibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying the scope of our derivations and indicating the revisions we will make to improve rigor and transparency.

read point-by-point responses

Referee: [§3.2] §3.2 and Eq. (8)–(11): the claimed exact recovery of group-relative policy gradients as the first-order approximation to simplex-restricted divergence minimization is not shown to hold when advantages are computed over token sequences rather than whole responses; the normalization across the sampled group appears to introduce an effective target shift that is not accounted for in the projection step.

Authors: We thank the referee for this observation. Our analysis in §3.2 and Eqs. (8)–(11) is developed for the standard response-level setting of group-based RLVR, where advantages are computed over complete responses and group normalization directly yields the target distribution on the response simplex. In this regime the first-order approximation recovers the group-relative policy gradient without additional shift. We acknowledge that token-sequence advantages would introduce a per-token normalization effect that shifts the effective target. We will revise the manuscript to explicitly state the response-level assumption, add a clarifying remark on the token-level case, and note that the geometric equivalence holds precisely under response-level advantages. revision: yes
Referee: [Theorem 1] Theorem 1 (monotonic improvement): the proof assumes the target distribution remains fixed during the exact projection, but the group-relative advantage used to define the target is itself recomputed from the current policy samples; this creates a moving-target issue that may invalidate the monotonicity guarantee unless an additional contraction argument is supplied.

Authors: We appreciate the referee’s careful examination of the proof. Theorem 1 establishes monotonic improvement on the listwise objective for a single exact projection step with the target held fixed. In the iterative algorithm the target is recomputed from the current policy’s samples, which indeed creates a moving-target dynamic. This is a standard consideration in iterative policy optimization. We will revise the theorem statement and surrounding discussion to clarify that the monotonicity guarantee applies conditionally to each projection step with fixed target, and we will add a remark acknowledging the iterative moving-target issue while noting that empirical results demonstrate stable improvement. A full contraction-mapping analysis of the overall iteration is left for future work. revision: partial

Circularity Check

1 steps flagged

Reinterpretation of group-based gradients as implicit target-projection on simplex makes monotonicity and stability claims reduce to the same construction

specific steps

renaming known result [Abstract]
"This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, a"

The geometric structure is presented as a revelation about prior group-based methods, yet LPO is defined by making that same structure explicit. The monotonicity and stability properties are then derived from the projection geometry itself. Once the target is identified with the group-relative advantage distribution and the simplex restriction is imposed, the listed benefits are tautological consequences of the construction rather than new results independent of the original group-based updates.

full rationale

The paper's core derivation begins by asserting that existing group-based policy gradients implicitly define a target on the response simplex and approximate projection via first-order updates. LPO is then introduced by restricting the proximal objective to that simplex and replacing the approximation with exact divergence minimization. The listed guarantees (monotonic listwise improvement, bounded/zero-sum/self-correcting gradients) are obtained directly from the geometry of this explicit projection. Because the target and projection step are defined from the very group-relative advantages and sampling procedure of the baseline methods, the claimed advantages follow by construction once the equivalence is posited, rather than from an independent derivation or external benchmark. This matches the renaming-known-result pattern with load-bearing impact on the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that group-based policy gradients share an implicit target-projection geometry on the response simplex; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Group-based policy gradient methods implicitly define a target distribution on the response simplex and project toward it via first-order approximation.
This is the foundational revelation stated in the abstract on which the LPO proposal is built.

pith-pipeline@v0.9.0 · 5774 in / 1291 out tokens · 61144 ms · 2026-05-21T08:58:13.137733+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

max_w∈Δ^{K-1} Ĵ(w) = ∑ w_k R_k - τ D_KL(w∥P_t) ... w^*_k = softmax(R_k/τ + s_t,k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 23 internal anchors

[1]

Neural computation , volume=

Natural gradient works efficiently in learning , author=. Neural computation , volume=. 1998 , publisher=

work page 1998
[2]

Maximum a Posteriori Policy Optimisation

Maximum a posteriori policy optimisation , author=. arXiv preprint arXiv:1806.06920 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Encyclopedia of Machine Learning , pages=

Kullback-leibler divergence , author=. Encyclopedia of Machine Learning , pages=

work page
[4]

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent , author=. arXiv preprint arXiv:2605.02469 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[6]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[7]

Journal of Machine Learning Research , volume=

Gflownet foundations , author=. Journal of Machine Learning Research , volume=

work page
[8]

Learning , volume=

From ranknet to lambdarank to lambdamart: An overview , author=. Learning , volume=

work page
[9]

Proceedings of the 24th international conference on Machine learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=

work page
[10]

Neural Computation , volume=

Using expectation-maximization for reinforcement learning , author=. Neural Computation , volume=. 1997 , publisher=

work page 1997
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv e-prints , pages=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv e-prints , pages=

work page
[13]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2602.02710 , year=

Maximum Likelihood Reinforcement Learning , author=. arXiv preprint arXiv:2602.02710 , year=

work page arXiv
[16]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

work page 2018
[18]

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models , author=. arXiv preprint arXiv:2602.01970 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

2025 , eprint=

Group Sequence Policy Optimization , author=. 2025 , eprint=

work page 2025
[20]

arXiv preprint arXiv:2502.18548 , year=

What is the Alignment Objective of GRPO? , author=. arXiv preprint arXiv:2502.18548 , year=

work page arXiv
[21]

Advances in neural information processing systems , volume=

A natural policy gradient , author=. Advances in neural information processing systems , volume=

work page
[22]

arXiv preprint arXiv:1909.12238 , year=

V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control , author=. arXiv preprint arXiv:1909.12238 , year=

work page arXiv 1909
[23]

Tomar, L

Mirror descent policy optimization , author=. arXiv preprint arXiv:2005.09814 , year=

work page arXiv 2005
[24]

International conference on machine learning , pages=

A theory of regularized markov decision processes , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[25]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification , author=. arXiv preprint arXiv:2503.06639 , year=

work page arXiv
[26]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Machine learning , volume=

Learning to predict by the methods of temporal differences , author=. Machine learning , volume=. 1988 , publisher=

work page 1988
[28]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , author =. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) , year =

work page 2021
[31]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Advances in Neural Information Processing Systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[35]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2510.01135 , year=

Prompt curriculum learning for efficient llm post-training , author=. arXiv preprint arXiv:2510.01135 , year=

work page arXiv
[37]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

work page 2017
[38]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2026 , eprint=

Target Policy Optimization , author=. 2026 , eprint=

work page 2026
[40]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

work page
[42]

Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr , title =

work page
[43]

arXiv preprint arXiv:2603.10887 , year=

Dynamics-predictive sampling for active RL finetuning of large reasoning models , author=. arXiv preprint arXiv:2603.10887 , year=

work page arXiv
[44]

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Bj¨orn Ommer, and Xiangyang Ji

Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? , author=. arXiv preprint arXiv:2507.04632 , year=

work page arXiv
[45]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[46]

arXiv preprint arXiv:2310.10505 , year=

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , author=. arXiv preprint arXiv:2310.10505 , year=

work page arXiv
[47]

Notion Blog , year=

Deepcoder: A fully open-source 14b coder at o3-mini level , author=. Notion Blog , year=

work page
[48]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[49]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

The American Statistician , volume=

A tutorial on MM algorithms , author=. The American Statistician , volume=. 2004 , publisher=

work page 2004
[52]

Proceedings of the Royal Society of London

An invariant form for the prior probability in estimation problems , author=. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences , volume=. 1946 , publisher=

work page 1946
[53]

Learning in graphical models , pages=

A view of the EM algorithm that justifies incremental, sparse, and other variants , author=. Learning in graphical models , pages=. 1998 , publisher=

work page 1998
[54]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[55]

IEEE Transactions on Information theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

work page 2002
[56]

1959 , publisher=

Individual choice behavior , author=. 1959 , publisher=

work page 1959
[57]

Maestro: Learning to collaborate via conditional listwise policy optimization for multi-agent llms.arXiv preprint arXiv:2511.06134, 2025a

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs , author=. arXiv preprint arXiv:2511.06134 , year=

work page arXiv
[58]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

work page
[60]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1975 , publisher=

work page 1975
[62]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024
[63]

It Takes Two: Your GRPO Is Secretly DPO

It takes two: Your grpo is secretly dpo , author=. arXiv preprint arXiv:2510.00977 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024
[67]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

work page
[68]

arXiv preprint arXiv:2509.15207 , year=

Flowrl: Matching reward distributions for llm reasoning , author=. arXiv preprint arXiv:2509.15207 , year=

work page arXiv
[69]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[71]

2025 , eprint=

LiPO: Listwise Preference Optimization through Learning-to-Rank , author=. 2025 , eprint=

work page 2025
[72]

2010 , publisher=

Modeling purposeful adaptive behavior with the principle of maximum causal entropy , author=. 2010 , publisher=

work page 2010
[73]

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints , author=. arXiv preprint arXiv:2309.16240 , year=

work page arXiv
[74]

Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward , author=. arXiv preprint arXiv:2509.07430 , year=

work page arXiv
[75]

Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes , author=

work page

[1] [1]

Neural computation , volume=

Natural gradient works efficiently in learning , author=. Neural computation , volume=. 1998 , publisher=

work page 1998

[2] [2]

Maximum a Posteriori Policy Optimisation

Maximum a posteriori policy optimisation , author=. arXiv preprint arXiv:1806.06920 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Encyclopedia of Machine Learning , pages=

Kullback-leibler divergence , author=. Encyclopedia of Machine Learning , pages=

work page

[4] [4]

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent , author=. arXiv preprint arXiv:2605.02469 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[6] [6]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992

[7] [7]

Journal of Machine Learning Research , volume=

Gflownet foundations , author=. Journal of Machine Learning Research , volume=

work page

[8] [8]

Learning , volume=

From ranknet to lambdarank to lambdamart: An overview , author=. Learning , volume=

work page

[9] [9]

Proceedings of the 24th international conference on Machine learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th international conference on Machine learning , pages=

work page

[10] [10]

Neural Computation , volume=

Using expectation-maximization for reinforcement learning , author=. Neural Computation , volume=. 1997 , publisher=

work page 1997

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv e-prints , pages=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv e-prints , pages=

work page

[13] [13]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Reinforcement learning and control as probabilistic inference: Tutorial and review , author=. arXiv preprint arXiv:1805.00909 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2602.02710 , year=

Maximum Likelihood Reinforcement Learning , author=. arXiv preprint arXiv:2602.02710 , year=

work page arXiv

[16] [16]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

2018 , publisher=

Lectures on convex optimization , author=. 2018 , publisher=

work page 2018

[18] [18]

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models , author=. arXiv preprint arXiv:2602.01970 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

2025 , eprint=

Group Sequence Policy Optimization , author=. 2025 , eprint=

work page 2025

[20] [20]

arXiv preprint arXiv:2502.18548 , year=

What is the Alignment Objective of GRPO? , author=. arXiv preprint arXiv:2502.18548 , year=

work page arXiv

[21] [21]

Advances in neural information processing systems , volume=

A natural policy gradient , author=. Advances in neural information processing systems , volume=

work page

[22] [22]

arXiv preprint arXiv:1909.12238 , year=

V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control , author=. arXiv preprint arXiv:1909.12238 , year=

work page arXiv 1909

[23] [23]

Tomar, L

Mirror descent policy optimization , author=. arXiv preprint arXiv:2005.09814 , year=

work page arXiv 2005

[24] [24]

International conference on machine learning , pages=

A theory of regularized markov decision processes , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019

[25] [25]

Reinforcement learning with verifiable rewards: Grpo’s effective loss, dy- namics, and success amplification.arXiv preprint arXiv:2503.06639, 2025

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification , author=. arXiv preprint arXiv:2503.06639 , year=

work page arXiv

[26] [26]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Machine learning , volume=

Learning to predict by the methods of temporal differences , author=. Machine learning , volume=. 1988 , publisher=

work page 1988

[28] [28]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , author =. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) , year =

work page 2021

[31] [31]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Advances in Neural Information Processing Systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[34] [34]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[35] [35]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2510.01135 , year=

Prompt curriculum learning for efficient llm post-training , author=. arXiv preprint arXiv:2510.01135 , year=

work page arXiv

[37] [37]

2017 , eprint=

Trust Region Policy Optimization , author=. 2017 , eprint=

work page 2017

[38] [38]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

2026 , eprint=

Target Policy Optimization , author=. 2026 , eprint=

work page 2026

[40] [40]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

work page

[42] [42]

Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr , title =

work page

[43] [43]

arXiv preprint arXiv:2603.10887 , year=

Dynamics-predictive sampling for active RL finetuning of large reasoning models , author=. arXiv preprint arXiv:2603.10887 , year=

work page arXiv

[44] [44]

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Bj¨orn Ommer, and Xiangyang Ji

Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? , author=. arXiv preprint arXiv:2507.04632 , year=

work page arXiv

[45] [45]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910

[46] [46]

arXiv preprint arXiv:2310.10505 , year=

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models , author=. arXiv preprint arXiv:2310.10505 , year=

work page arXiv

[47] [47]

Notion Blog , year=

Deepcoder: A fully open-source 14b coder at o3-mini level , author=. Notion Blog , year=

work page

[48] [48]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Relative entropy policy search , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[49] [49]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

The American Statistician , volume=

A tutorial on MM algorithms , author=. The American Statistician , volume=. 2004 , publisher=

work page 2004

[52] [52]

Proceedings of the Royal Society of London

An invariant form for the prior probability in estimation problems , author=. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences , volume=. 1946 , publisher=

work page 1946

[53] [53]

Learning in graphical models , pages=

A view of the EM algorithm that justifies incremental, sparse, and other variants , author=. Learning in graphical models , pages=. 1998 , publisher=

work page 1998

[54] [54]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[55] [55]

IEEE Transactions on Information theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

work page 2002

[56] [56]

1959 , publisher=

Individual choice behavior , author=. 1959 , publisher=

work page 1959

[57] [57]

Maestro: Learning to collaborate via conditional listwise policy optimization for multi-agent llms.arXiv preprint arXiv:2511.06134, 2025a

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs , author=. arXiv preprint arXiv:2511.06134 , year=

work page arXiv

[58] [58]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page

[59] [59]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

work page

[60] [60]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1975 , publisher=

work page 1975

[62] [62]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

work page 2024

[63] [63]

It Takes Two: Your GRPO Is Secretly DPO

It takes two: Your grpo is secretly dpo , author=. arXiv preprint arXiv:2510.00977 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024

[67] [67]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

work page

[68] [68]

arXiv preprint arXiv:2509.15207 , year=

Flowrl: Matching reward distributions for llm reasoning , author=. arXiv preprint arXiv:2509.15207 , year=

work page arXiv

[69] [69]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[71] [71]

2025 , eprint=

LiPO: Listwise Preference Optimization through Learning-to-Rank , author=. 2025 , eprint=

work page 2025

[72] [72]

2010 , publisher=

Modeling purposeful adaptive behavior with the principle of maximum causal entropy , author=. 2010 , publisher=

work page 2010

[73] [73]

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints , author=. arXiv preprint arXiv:2309.16240 , year=

work page arXiv

[74] [74]

Convex and non-convex optimization under generalized smoothness.Advances in Neural Information Processing Systems, 36:40238–40271, 2023a

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward , author=. arXiv preprint arXiv:2509.07430 , year=

work page arXiv

[75] [75]

Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes , author=

work page