pith. sign in

arxiv: 2605.20865 · v1 · pith:W6PEMK22new · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

Pith reviewed 2026-05-21 05:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningpolicy optimizationPPOlarge language modelsverifiable rewardslikelihood ratiopolicy gradientreasoning
0
0 comments X

The pith

The N-step forward trace augments PPO to achieve tighter policy-improvement bounds in reinforcement learning with verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that augmenting the standard PPO surrogate with the cumulative likelihood ratio over multiple future tokens produces a more accurate objective for policy improvement. This is achieved through the N-step forward trace integrated into NFPO, which operates within the masked policy gradient setup for language models. If correct, this approach reduces the bias from local approximations while controlling variance, leading to better training of reasoning capabilities in LLMs using verifiable rewards. Theory supports a tighter bound for appropriate N, and experiments back this with gains on benchmarks.

Core claim

We introduce the N-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next N-1 tokens. We propose NFPO, a practical RLVR algorithm that integrates the N-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective. With an appropriate choice of N, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance.

What carries the argument

The N-step forward trace that augments the PPO surrogate objective by the cumulative likelihood ratio of the next N-1 tokens.

Load-bearing premise

That the cumulative likelihood ratio over the next N-1 tokens can be stably incorporated into the existing masked policy gradient framework without introducing new sources of instability or requiring additional variance-reduction techniques.

What would settle it

If the theoretical analysis fails to show a tighter bound for any N, or if empirical results on reasoning benchmarks show no improvement over standard PPO.

Figures

Figures reproduced from arXiv: 2605.20865 by Byeongchan Kim, Deokgyu Yoon, Gyungin Shin, Hyungkyu Kang, Joongkyu Lee, Min-hwan Oh, Sungrae Park.

Figure 1
Figure 1. Figure 1: NFPO corrects the bias of local PPO/GRPO-style objectives by reweighting each token likelihood ratio ρt :“ πpyt|stq µpyt|stq with a forward trace Γ pNq t`1 . (Right) The trace horizon N controls the bias–variance trade-off, interpolating between PPO/GRPO (N “ 1) and the exact policy gradient objective (N “ T). Abstract Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in im￾proving… view at source ↗
Figure 2
Figure 2. Figure 2: Bias–variance trade-off of the N-step sur￾rogate objective on the token MDP example. However, at the same time, the variance VrLˆ pNq µ pπqs increases with N, reflecting the growing instability introduced by longer products of importance ratios. Overall, this example provides a clean and controlled validation of the theory: small N yields a stable but biased objective, while large N reduces bias at the cos… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Comparison of token-level (ρt) and trace-corrected (ρt ¨ Γ¯pN,iq t`1 ) likelihood ratios along a 50-token rollout segment. (Right) Batch-averaged switch frequency, halved by the forward trace. Etr|ρt ¨ Γ¯ pN,iq t`1 ´ 1|s quantifies the correction strength of NFPO. We compute correction strength for approximately 3.4M tokens collected during training, and categorize the tokens semantically to compare… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation results across N, where the N “ 1 corresponds to DPPO. (Left) Variance of forward trace. (Middle) Switch frequency of forward trace. (Right) Performance on the benchmarks. DPPO [0.6, 1.2] [0.8, 1.2] [0.8, 1.4] Clip Range 0.26 0.27 0.28 0.29 0.30 0.31 0.32 Performance DPPO NFPO + recompute t clip 0.26 0.27 0.28 0.29 0.30 0.31 0.32 Performance 0 100 200 300 400 500 0.075 0.100 0.125 0.150 0.175 0.20… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation experiments on NFPO: (Left) Different clip ranges, (Middle) Alternative design choices, and (Right) Different token masks. trace and the strength of the low-pass filtering are optimally balanced. This interplay naturally leads to the emergence of a performance sweet spot. 6 Ablation Study We present a comprehensive ablation study on the hyperparameters and design choices of NFPO. Ablation 1. Forwa… view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces N-Step Forward-Trace Policy Optimization (NFPO) for RL with verifiable rewards in LLMs. It augments the PPO surrogate with an N-step forward trace (cumulative likelihood-ratio product over the next N-1 tokens) and integrates it into the masked policy-gradient framework for token-level updates. The central theoretical claim is that an appropriate N yields a strictly tighter policy-improvement bound than the standard one-step PPO surrogate; experiments on reasoning benchmarks are reported to support the claim.

Significance. If the bound derivation holds after proper mask handling, the method supplies a controllable bias-variance bridge between the PPO surrogate and the exact policy gradient, which is a useful conceptual advance for RLVR. The explicit integration into masked token-level gradients and the reported benchmark gains are concrete strengths.

major comments (1)
  1. [§4] §4 (Theorem 1 and surrounding derivation): the proof that the N-step objective produces a strictly tighter improvement bound treats the cumulative likelihood-ratio product as if formed over a fully observed trajectory. In the masked policy-gradient setting used for token-level LLM updates, only a subset of positions contribute; the manuscript supplies no equation showing that the mask indicator is propagated inside the expectation, so the claimed reduction in approximation error does not necessarily follow.
minor comments (2)
  1. [Abstract] The abstract and §5 refer to 'comprehensive reasoning benchmarks' without listing the exact datasets or splits used; this should be stated explicitly.
  2. [§5] Table 2 and Figure 3 lack error bars or standard deviations across seeds; adding them would strengthen the empirical support for the theoretical prediction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The primary concern raised pertains to the explicit handling of masks within the theoretical derivation of Theorem 1. We address this point below and will revise the manuscript to incorporate the requested clarification.

read point-by-point responses
  1. Referee: [§4] §4 (Theorem 1 and surrounding derivation): the proof that the N-step objective produces a strictly tighter improvement bound treats the cumulative likelihood-ratio product as if formed over a fully observed trajectory. In the masked policy-gradient setting used for token-level LLM updates, only a subset of positions contribute; the manuscript supplies no equation showing that the mask indicator is propagated inside the expectation, so the claimed reduction in approximation error does not necessarily follow.

    Authors: We agree that the current write-up of the proof would be strengthened by an explicit equation showing propagation of the mask indicator inside the expectation. In the revised version we will insert the masked N-step likelihood-ratio product, defined as the product of per-token ratios multiplied by the indicator that the position contributes to the gradient. We will then re-derive the policy-improvement bound under this masked expectation and show that the N-step surrogate remains strictly tighter than the one-step PPO surrogate for appropriate N. This addition directly addresses the concern while preserving the original claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the bound derivation

full rationale

The paper derives a tighter policy-improvement bound for the N-step forward-trace objective relative to the standard PPO surrogate via theoretical analysis. This is a conventional bias-variance argument in which larger N reduces the structural bias of the local surrogate toward the exact policy gradient; the bound is not shown to be tautological or obtained by fitting a parameter that is then relabeled as a prediction. No self-definitional steps, fitted inputs presented as predictions, or load-bearing self-citations appear in the provided material. The derivation remains independent of its own outputs and is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work rests on standard policy-gradient and importance-sampling assumptions plus the new N-step forward trace construct; N itself functions as a tunable parameter whose selection affects the bound tightness.

free parameters (1)
  • N
    Step count chosen to achieve the tighter bound; its value is not derived from first principles in the abstract.
axioms (1)
  • domain assumption Policy gradient theorem and importance sampling hold under the masked objective used in RLVR.
    Invoked implicitly when moving from PPO surrogate to the N-step correction.
invented entities (1)
  • N-step forward trace no independent evidence
    purpose: Augments the surrogate objective with cumulative likelihood ratios of future tokens.
    New object introduced by the paper; no independent falsifiable evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5767 in / 1348 out tokens · 51061 ms · 2026-05-21T05:50:43.507182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 11 internal anchors

  1. [1]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  2. [2]

    Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

  3. [3]

    Reasoning with exploration: An entropy perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

  4. [4]

    GPG: A simple and strong reinforcement learning baseline for model reasoning

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=inccdtfx8x

  5. [5]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  6. [6]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  9. [9]

    Online causal kalman filtering for stable and effective policy optimization.arXiv preprint arXiv:2602.10609, 2026

    Shuo He, Lang Feng, Xin Cheng, Lei Feng, and Bo An. Online causal kalman filtering for stable and effective policy optimization.arXiv preprint arXiv:2602.10609, 2026

  10. [10]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

  11. [11]

    Approximately optimal approximate reinforcement learning

    Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the nineteenth international conference on machine learning, pages 267–274, 2002. 10

  12. [12]

    Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  13. [13]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  14. [14]

    CPPO: Accelerating the training of group relative policy optimization-based reasoning models

    ZhiHang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. CPPO: Accelerating the training of group relative policy optimization-based reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=SVHerutWxp

  15. [15]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  16. [16]

    Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

    Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

  17. [17]

    American mathematics competitions (AMC)

    MAA. American mathematics competitions (AMC). https://maa.org/, 2023. MAA. American mathematics competitions–AMC

  18. [18]

    American invitational mathematics examination (AIME)

    MAA. American invitational mathematics examination (AIME). https://maa.org/, 2026. MAA. American invitational mathematics examination–AIME

  19. [19]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Blog

  20. [20]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  21. [21]

    arXiv preprint arXiv:2603.11682 , year=

    Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, and Philipp Krähenbühl. Entropy-preserving reinforcement learning.arXiv preprint arXiv:2603.11682, 2026

  22. [22]

    Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

    Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

  23. [23]

    Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

    Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

  24. [24]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  25. [25]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

  28. [28]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  29. [29]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

  30. [30]

    arXiv preprint arXiv:2602.02710 , year=

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710, 2026

  31. [31]

    Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

    Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

  32. [32]

    When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

    Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

  33. [33]

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.Advances in Neural Information Processing Systems, 39, 2025

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.Advances in Neural Information P...

  34. [34]

    Quantile advantage estimation for entropy-safe reasoning.arXiv preprint arXiv:2509.22611, 2025

    Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, and Xiangnan He. Quantile advantage estimation for entropy-safe reasoning.arXiv preprint arXiv:2509.22611, 2025

  35. [35]

    A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025

    Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025

  36. [36]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  37. [37]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  38. [38]

    Srpo: A cross-domain imple- mentation of large-scale reinforcement learning on llm.arXiv preprint arXiv:2504.14286, 2025

    Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, et al. Srpo: A cross-domain imple- mentation of large-scale reinforcement learning on llm.arXiv preprint arXiv:2504.14286, 2025

  39. [39]

    Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

  40. [40]

    Stabilizing reinforcement learning with llms: Formulation and practices

    Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

  41. [41]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  42. [42]

    E y„µ « Rpyq Tÿ t“1 ` ρt ´1 ˘ ´ Γt`1 ´Γ pNq t`1 ¯ff “E y„µ « Rpyq Tÿ t“1 ` ρt ´1 ˘ ΓpNq t`1 pQmt`1 ´1q ff ďξ Tÿ t“1 Eµ

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=ftVlLG9cks. 12 A Related Works Reinforcement learning with verifiable rewards (RLVR).R...

  43. [43]

    0.56 on the token However and stays low across the whole self-correction phrase “However, this does not fit the context. . . Let’s re-evaluate

    0.999 1.08 1.083´0.500 884¨Let 1.000 1.05 1.052´0.500 885’s 1.000 1.05 1.052´0.500 886¨re 0.986 1.05 1.052´0.500 887-e 1.063 0.94 0.942´0.500 888valuate 1.000 0.94 0.942´0.500 This case is the symmetric counterpart of Case 2. The trajectory ultimately fails (reward0) and the global advantage is therefore negative ( ˆAt “ ´0.500). A purely local surrogate ...