LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Bowen Deng; Jinghan Li; Liang Zhao; Xinyuan Chen; Yipeng Zhou; Zhe Yuan; Zhiqian Chen

arxiv: 2605.21235 · v1 · pith:GPCFRW5Znew · submitted 2026-05-20 · 💻 cs.CL

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Zhe Yuan , Yipeng Zhou , Jinghan Li , Xinyuan Chen , Bowen Deng , Zhiqian Chen , Liang Zhao This is my paper

Pith reviewed 2026-05-21 05:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords policy optimizationreinforcement learningreasoning language modelspairwise advantageRLVRcredit assignmentsparse rewardsGRPO

0 comments

The pith

LamPO replaces scalar group advantages with pairwise decomposed advantages to improve credit assignment under sparse rewards in reasoning model RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LamPO as a lambda-style policy optimization method for training reasoning language models with verifiable rewards. It identifies that group-relative methods like GRPO lose fine-grained relational information by summarizing candidate responses with scalar statistics, weakening credit assignment when solutions differ subtly. LamPO instead aggregates pairwise reward gaps within each group and modulates each comparison by a weight derived from log-probability differences, while preserving the critic-free clipped PPO structure. An optional ROUGE-L auxiliary reward is added when references are available to further densify the signal. Experiments across multiple model sizes and benchmarks show gains in performance, training stability, and sample efficiency over GRPO and other RLVR variants.

Core claim

LamPO defines a Pairwise Decomposed Advantage that aggregates pairwise reward gaps within response groups, each modulated by a confidence-aware weight computed from sequence log-probability differences, thereby replacing scalar group advantages to deliver finer credit assignment in critic-free policy optimization for reasoning tasks.

What carries the argument

Pairwise Decomposed Advantage, which computes and aggregates modulated pairwise reward gaps to extract relational credit signals from candidate responses.

If this is right

LamPO yields higher scores than GRPO on AIME24, AIME25, MATH-500, and GPQA-Diamond across Qwen3 and Phi-4 models.
Training curves exhibit greater stability and reduced variance with LamPO.
LamPO reaches target performance with fewer training samples than prior RLVR methods.
The optional ROUGE-L auxiliary reward further mitigates sparsity when reference solutions exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pairwise modulation mechanism may generalize to other sparse-reward RL domains where candidate actions differ by small quality margins.
By extracting more signal from each sampled group, LamPO could allow smaller group sizes without sacrificing credit assignment quality.
The approach suggests a route to hybridize outcome-based RL with lightweight ranking signals for language model post-training.

Load-bearing premise

That pairwise reward gaps modulated by log-probability differences yield meaningfully better credit assignment than scalar group statistics when candidate solutions differ only subtly under sparse outcome rewards.

What would settle it

Ablation experiments on AIME24 or MATH-500 with the same models and group sizes showing that LamPO produces no measurable gains in accuracy or stability compared to GRPO under identical training budgets.

Figures

Figures reproduced from arXiv: 2605.21235 by Bowen Deng, Jinghan Li, Liang Zhao, Xinyuan Chen, Yipeng Zhou, Zhe Yuan, Zhiqian Chen.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LamPO swaps scalar advantages for pairwise ones in GRPO-style RLVR, but the gains could easily trace to the added ROUGE-L term rather than the new credit assignment.

read the letter

LamPO replaces the scalar group advantage in GRPO with a pairwise decomposed version that aggregates reward gaps and weights them by log-probability differences. This targets the loss of fine-grained relational signals when sampled responses are similar under sparse outcome rewards. The method stays critic-free and keeps the clipped PPO update, which keeps implementation straightforward for existing RLVR pipelines. They also add an optional ROUGE-L dense auxiliary when references exist to ease sparsity on math and QA tasks. Experiments with Qwen3 and Phi-4-mini models on AIME24, AIME25, MATH-500, and GPQA-Diamond report steadier training and better sample efficiency than GRPO and other recent variants. The core idea is a direct extension of group-relative objectives, and the motivation for using pairwise comparisons is clearly stated. The citation pattern follows the standard RLVR and GRPO references without obvious gaps. The main soft spot is the lack of isolation for the pairwise mechanism. Because the ROUGE-L term is added in some settings, the comparisons do not cleanly show whether the reported lifts come from the decomposed advantage or from the extra dense signal. Without that separation the load-bearing claim about improved credit assignment stays partly untested. This work is for groups already running RLVR on reasoning models who want practical tweaks to advantage estimation. A reader who knows GRPO would see the change quickly and could try it on their own setups. It deserves peer review because the problem is real, the method is motivated, and the benchmarks are relevant even if tighter controls on the auxiliary reward would strengthen the case.

Referee Report

1 major / 1 minor

Summary. The paper introduces LamPO, a Lambda-Style Policy Optimization method for reasoning language models under reinforcement learning with verifiable rewards (RLVR). It replaces the scalar group statistics of GRPO with a Pairwise Decomposed Advantage that aggregates pairwise reward gaps modulated by confidence-aware weights derived from sequence log-probability differences, while preserving a critic-free clipped PPO-style update. When references are available, a lightweight ROUGE-L dense auxiliary reward is added to address sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini models report consistent improvements over GRPO and recent RLVR variants, along with more stable training dynamics and better sample efficiency.

Significance. If the gains can be attributed to the pairwise relational credit assignment rather than auxiliary signals, the approach could meaningfully advance RLVR by exploiting fine-grained differences among candidate solutions under sparse outcome rewards, potentially improving stability and efficiency in training reasoning models without introducing a critic network.

major comments (1)

Experiments section: The central claim that aggregating pairwise reward gaps modulated by log-probability differences yields better credit assignment than GRPO scalar statistics under sparse rewards is not isolated from the optional ROUGE-L auxiliary reward. The reported comparisons to GRPO and RLVR variants do not include an ablation that applies the pairwise decomposed advantage without the ROUGE-L component (or vice versa), leaving open the possibility that performance lifts on AIME24, AIME25, MATH-500, and GPQA-Diamond stem from the added dense signal rather than the proposed relational mechanism.

minor comments (1)

Abstract and method description: The motivation for the 'Lambda-Style' nomenclature is not elaborated; a brief clarification of its connection to lambda-style weighting or related concepts would improve accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comment and provide a point-by-point response below, including planned revisions to strengthen the paper.

read point-by-point responses

Referee: Experiments section: The central claim that aggregating pairwise reward gaps modulated by log-probability differences yields better credit assignment than GRPO scalar statistics under sparse rewards is not isolated from the optional ROUGE-L auxiliary reward. The reported comparisons to GRPO and RLVR variants do not include an ablation that applies the pairwise decomposed advantage without the ROUGE-L component (or vice versa), leaving open the possibility that performance lifts on AIME24, AIME25, MATH-500, and GPQA-Diamond stem from the added dense signal rather than the proposed relational mechanism.

Authors: We agree that isolating the contribution of the Pairwise Decomposed Advantage from the optional ROUGE-L auxiliary reward is important for substantiating the central claim. The manuscript presents the ROUGE-L component as lightweight and conditional on the availability of reference solutions, primarily to address reward sparsity in certain settings. However, the current experimental comparisons do not explicitly ablate its presence. To address this, we will add a dedicated ablation study in the revised manuscript. Specifically, we will report results for the core LamPO formulation (pairwise decomposed advantage with confidence-aware weighting, without ROUGE-L) on AIME24, AIME25, MATH-500, and GPQA-Diamond, directly compared to GRPO and other RLVR baselines. These additional results will clarify that the observed gains in performance, stability, and sample efficiency arise from the relational credit assignment mechanism rather than the auxiliary dense signal. We believe this revision will resolve the concern and improve the clarity of the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; LamPO is a direct definitional replacement for GRPO

full rationale

The paper defines LamPO explicitly as a new objective that substitutes scalar group advantages with a pairwise decomposed advantage (aggregating reward gaps modulated by log-probability differences) while retaining PPO-style clipping. This is a constructive proposal of a modified loss, not a derivation that reduces to its own fitted inputs or prior self-citations. No equations are shown to equal their inputs by construction, no uniqueness theorems are invoked from overlapping authors, and the optional ROUGE-L auxiliary is presented as an add-on rather than a load-bearing premise. Experimental improvements are claimed via direct comparison on benchmarks, leaving the method self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the untested premise that pairwise reward gaps carry useful fine-grained information beyond scalar summaries; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5767 in / 1090 out tokens · 17332 ms · 2026-05-21T05:08:09.012324+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LamPO replaces scalar group advantages with a Pairwise Decomposed Advantage (PDA) ... Aλ(oi) = 1/(G−1) Σ (R(oi)−R(oj)) σ(Δsij/τ)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose LamPO, a Lambda-Style Policy Optimization method

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 14 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V. Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture- of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

G. Cui, L. Yuan, Z. Wang, H. Wang, Y. Zhang, J. Chen, W. Li, B. He, Y. Fan, T. Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

C. Gao, C. Zheng, X.-H. Chen, K. Dang, S. Liu, B. Yu, A. Yang, S. Bai, J. Zhou, and J. Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Gehring, K

J. Gehring, K. Zheng, J. Copet, V. Mella, Q. Carbonneaux, T. Cohen, and G. Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

work page arXiv 2024
[5]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

J. Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Available: https://arxiv.org/abs/2410.01679

A. Kazemnejad, M. Aghajohari, E. Portelance, A. Sordoni, S. Reddy, A. Courville, and N. L. Roux. Vineppo: Refining credit assignment in rl training of llms.arXiv preprint arXiv:2410.01679, 2024. 11

work page arXiv 2024
[10]

Lightman, V

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representa- tions, 2023

work page 2023
[11]

C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

work page 2004
[12]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[13]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. hsin Chi, F. Xia, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.ArXiv, abs/2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

T. Yu, B. Ji, S. Wang, S. Yao, Z. Wang, G. Cui, L. Yuan, N. Ding, Y. Yao, Z. Liu, et al. Rlpr: Extrapolating rlvr to general domains without verifiers.arXiv preprint arXiv:2506.18254, 2025

work page arXiv 2025
[19]

W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Zhang and T

Y. Zhang and T. Math-AI. American invitational mathematics exami- nation (aime) 2024, 2024. 12

work page 2024
[21]

Zhang and T

Y. Zhang and T. Math-AI. American invitational mathematics exami- nation (aime) 2025, 2025

work page 2025
[22]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V. Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture- of-loras.arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

G. Cui, L. Yuan, Z. Wang, H. Wang, Y. Zhang, J. Chen, W. Li, B. He, Y. Fan, T. Yu, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

C. Gao, C. Zheng, X.-H. Chen, K. Dang, S. Liu, B. Yu, A. Yang, S. Bai, J. Zhou, and J. Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Gehring, K

J. Gehring, K. Zheng, J. Copet, V. Mella, Q. Carbonneaux, T. Cohen, and G. Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

work page arXiv 2024

[5] [5]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

J. Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open- reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Available: https://arxiv.org/abs/2410.01679

A. Kazemnejad, M. Aghajohari, E. Portelance, A. Sordoni, S. Reddy, A. Courville, and N. L. Roux. Vineppo: Refining credit assignment in rl training of llms.arXiv preprint arXiv:2410.01679, 2024. 11

work page arXiv 2024

[10] [10]

Lightman, V

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representa- tions, 2023

work page 2023

[11] [11]

C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

work page 2004

[12] [12]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[13] [13]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. hsin Chi, F. Xia, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.ArXiv, abs/2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

T. Yu, B. Ji, S. Wang, S. Yao, Z. Wang, G. Cui, L. Yuan, N. Ding, Y. Yao, Z. Liu, et al. Rlpr: Extrapolating rlvr to general domains without verifiers.arXiv preprint arXiv:2506.18254, 2025

work page arXiv 2025

[19] [19]

W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He. Simplerl- zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Zhang and T

Y. Zhang and T. Math-AI. American invitational mathematics exami- nation (aime) 2024, 2024. 12

work page 2024

[21] [21]

Zhang and T

Y. Zhang and T. Math-AI. American invitational mathematics exami- nation (aime) 2025, 2025

work page 2025

[22] [22]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025