Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

Byeongchan Kim; Deokgyu Yoon; Gyungin Shin; Hyungkyu Kang; Joongkyu Lee; Min-hwan Oh; Sungrae Park

arxiv: 2605.20865 · v1 · pith:W6PEMK22new · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

Deokgyu Yoon , Hyungkyu Kang , Joongkyu Lee , Byeongchan Kim , Gyungin Shin , Sungrae Park , Min-hwan Oh This is my paper

Pith reviewed 2026-05-21 05:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy optimizationPPOlarge language modelsverifiable rewardslikelihood ratiopolicy gradientreasoning

0 comments

The pith

The N-step forward trace augments PPO to achieve tighter policy-improvement bounds in reinforcement learning with verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that augmenting the standard PPO surrogate with the cumulative likelihood ratio over multiple future tokens produces a more accurate objective for policy improvement. This is achieved through the N-step forward trace integrated into NFPO, which operates within the masked policy gradient setup for language models. If correct, this approach reduces the bias from local approximations while controlling variance, leading to better training of reasoning capabilities in LLMs using verifiable rewards. Theory supports a tighter bound for appropriate N, and experiments back this with gains on benchmarks.

Core claim

We introduce the N-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next N-1 tokens. We propose NFPO, a practical RLVR algorithm that integrates the N-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective. With an appropriate choice of N, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance.

What carries the argument

The N-step forward trace that augments the PPO surrogate objective by the cumulative likelihood ratio of the next N-1 tokens.

Load-bearing premise

That the cumulative likelihood ratio over the next N-1 tokens can be stably incorporated into the existing masked policy gradient framework without introducing new sources of instability or requiring additional variance-reduction techniques.

What would settle it

If the theoretical analysis fails to show a tighter bound for any N, or if empirical results on reasoning benchmarks show no improvement over standard PPO.

Figures

Figures reproduced from arXiv: 2605.20865 by Byeongchan Kim, Deokgyu Yoon, Gyungin Shin, Hyungkyu Kang, Joongkyu Lee, Min-hwan Oh, Sungrae Park.

**Figure 1.** Figure 1: NFPO corrects the bias of local PPO/GRPO-style objectives by reweighting each token likelihood ratio ρt :“ πpyt|stq µpyt|stq with a forward trace Γ pNq t`1 . (Right) The trace horizon N controls the bias–variance trade-off, interpolating between PPO/GRPO (N “ 1) and the exact policy gradient objective (N “ T). Abstract Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving… view at source ↗

**Figure 2.** Figure 2: Bias–variance trade-off of the N-step surrogate objective on the token MDP example. However, at the same time, the variance VrLˆ pNq µ pπqs increases with N, reflecting the growing instability introduced by longer products of importance ratios. Overall, this example provides a clean and controlled validation of the theory: small N yields a stable but biased objective, while large N reduces bias at the cos… view at source ↗

**Figure 3.** Figure 3: (Left) Comparison of token-level (ρt) and trace-corrected (ρt ¨ Γ¯pN,iq t`1 ) likelihood ratios along a 50-token rollout segment. (Right) Batch-averaged switch frequency, halved by the forward trace. Etr|ρt ¨ Γ¯ pN,iq t`1 ´ 1|s quantifies the correction strength of NFPO. We compute correction strength for approximately 3.4M tokens collected during training, and categorize the tokens semantically to compare… view at source ↗

**Figure 4.** Figure 4: Ablation results across N, where the N “ 1 corresponds to DPPO. (Left) Variance of forward trace. (Middle) Switch frequency of forward trace. (Right) Performance on the benchmarks. DPPO [0.6, 1.2] [0.8, 1.2] [0.8, 1.4] Clip Range 0.26 0.27 0.28 0.29 0.30 0.31 0.32 Performance DPPO NFPO + recompute t clip 0.26 0.27 0.28 0.29 0.30 0.31 0.32 Performance 0 100 200 300 400 500 0.075 0.100 0.125 0.150 0.175 0.20… view at source ↗

**Figure 5.** Figure 5: Ablation experiments on NFPO: (Left) Different clip ranges, (Middle) Alternative design choices, and (Right) Different token masks. trace and the strength of the low-pass filtering are optimally balanced. This interplay naturally leads to the emergence of a performance sweet spot. 6 Ablation Study We present a comprehensive ablation study on the hyperparameters and design choices of NFPO. Ablation 1. Forwa… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NFPO adds an N-step likelihood product to the PPO surrogate for RLVR in LLMs and claims a tighter improvement bound, but the masking details in the derivation are the part that needs checking.

read the letter

The paper's core move is to replace the one-step importance ratio in the PPO surrogate with a product over the next N tokens. This creates a family of objectives that sits between the standard local PPO loss and the full importance-sampled gradient. They call the construction the N-step forward trace and build NFPO around it inside the usual masked token-level update for language models. The abstract says this yields a strictly tighter policy-improvement bound for suitable N and produces measurable gains on reasoning benchmarks. That is the actual new piece: a continuous interpolation knob rather than another variance-reduction trick or a new trust-region constraint. The experiments appear to be run on standard reasoning suites, which is the right place to test it. The integration claim into the masked framework is also stated plainly, which is better than many papers that just wave at the setting. The soft spot is exactly the one the stress-test flags. When you form the cumulative product, some future tokens are masked out in the gradient computation. If the bound proof treats the product as if every position contributes, the claimed reduction in approximation error does not automatically follow. The abstract gives no equation showing how the mask indicator travels through the expectation, so it is impossible to tell whether the tighter bound holds or whether it is an artifact of ignoring the mask. N is described only as “appropriate,” which leaves open the possibility that it is tuned on the same data used for the final numbers. That is a minor but real circularity risk until an ablation or fixed-N protocol is shown. The paper is aimed at people who already run PPO-style RLVR on LLMs and want a drop-in surrogate with one extra hyperparameter. A reader who cares about bias-variance control in token-level policy gradients will get something concrete to try. It is worth sending to referees because the algorithmic idea is well-specified and the target benchmarks are central; the theory section will just need the mask propagation written out explicitly before acceptance.

Referee Report

1 major / 2 minor

Summary. The paper introduces N-Step Forward-Trace Policy Optimization (NFPO) for RL with verifiable rewards in LLMs. It augments the PPO surrogate with an N-step forward trace (cumulative likelihood-ratio product over the next N-1 tokens) and integrates it into the masked policy-gradient framework for token-level updates. The central theoretical claim is that an appropriate N yields a strictly tighter policy-improvement bound than the standard one-step PPO surrogate; experiments on reasoning benchmarks are reported to support the claim.

Significance. If the bound derivation holds after proper mask handling, the method supplies a controllable bias-variance bridge between the PPO surrogate and the exact policy gradient, which is a useful conceptual advance for RLVR. The explicit integration into masked token-level gradients and the reported benchmark gains are concrete strengths.

major comments (1)

[§4] §4 (Theorem 1 and surrounding derivation): the proof that the N-step objective produces a strictly tighter improvement bound treats the cumulative likelihood-ratio product as if formed over a fully observed trajectory. In the masked policy-gradient setting used for token-level LLM updates, only a subset of positions contribute; the manuscript supplies no equation showing that the mask indicator is propagated inside the expectation, so the claimed reduction in approximation error does not necessarily follow.

minor comments (2)

[Abstract] The abstract and §5 refer to 'comprehensive reasoning benchmarks' without listing the exact datasets or splits used; this should be stated explicitly.
[§5] Table 2 and Figure 3 lack error bars or standard deviations across seeds; adding them would strengthen the empirical support for the theoretical prediction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The primary concern raised pertains to the explicit handling of masks within the theoretical derivation of Theorem 1. We address this point below and will revise the manuscript to incorporate the requested clarification.

read point-by-point responses

Referee: [§4] §4 (Theorem 1 and surrounding derivation): the proof that the N-step objective produces a strictly tighter improvement bound treats the cumulative likelihood-ratio product as if formed over a fully observed trajectory. In the masked policy-gradient setting used for token-level LLM updates, only a subset of positions contribute; the manuscript supplies no equation showing that the mask indicator is propagated inside the expectation, so the claimed reduction in approximation error does not necessarily follow.

Authors: We agree that the current write-up of the proof would be strengthened by an explicit equation showing propagation of the mask indicator inside the expectation. In the revised version we will insert the masked N-step likelihood-ratio product, defined as the product of per-token ratios multiplied by the indicator that the position contributes to the gradient. We will then re-derive the policy-improvement bound under this masked expectation and show that the N-step surrogate remains strictly tighter than the one-step PPO surrogate for appropriate N. This addition directly addresses the concern while preserving the original claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the bound derivation

full rationale

The paper derives a tighter policy-improvement bound for the N-step forward-trace objective relative to the standard PPO surrogate via theoretical analysis. This is a conventional bias-variance argument in which larger N reduces the structural bias of the local surrogate toward the exact policy gradient; the bound is not shown to be tautological or obtained by fitting a parameter that is then relabeled as a prediction. No self-definitional steps, fitted inputs presented as predictions, or load-bearing self-citations appear in the provided material. The derivation remains independent of its own outputs and is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work rests on standard policy-gradient and importance-sampling assumptions plus the new N-step forward trace construct; N itself functions as a tunable parameter whose selection affects the bound tightness.

free parameters (1)

N
Step count chosen to achieve the tighter bound; its value is not derived from first principles in the abstract.

axioms (1)

domain assumption Policy gradient theorem and importance sampling hold under the masked objective used in RLVR.
Invoked implicitly when moving from PPO surrogate to the N-step correction.

invented entities (1)

N-step forward trace no independent evidence
purpose: Augments the surrogate objective with cumulative likelihood ratios of future tokens.
New object introduced by the paper; no independent falsifiable evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5767 in / 1348 out tokens · 51061 ms · 2026-05-21T05:50:43.507182+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the N-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next N−1 tokens... Theorem 1 shows that increasing N reduces the truncation bias at the rate O((T−N)²δ²)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NFPO integrates the N-step forward trace into the masked policy gradient framework

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 11 internal anchors

[1]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025
[3]

Reasoning with exploration: An entropy perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

work page 2026
[4]

GPG: A simple and strong reinforcement learning baseline for model reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=inccdtfx8x

work page 2026
[5]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[9]

Online causal kalman filtering for stable and effective policy optimization.arXiv preprint arXiv:2602.10609, 2026

Shuo He, Lang Feng, Xin Cheng, Lei Feng, and Bo An. Online causal kalman filtering for stable and effective policy optimization.arXiv preprint arXiv:2602.10609, 2026

work page arXiv 2026
[10]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[11]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the nineteenth international conference on machine learning, pages 267–274, 2002. 10

work page 2002
[12]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[13]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[14]

CPPO: Accelerating the training of group relative policy optimization-based reasoning models

ZhiHang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. CPPO: Accelerating the training of group relative policy optimization-based reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=SVHerutWxp

work page 2026
[15]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

work page arXiv 2026
[17]

American mathematics competitions (AMC)

MAA. American mathematics competitions (AMC). https://maa.org/, 2023. MAA. American mathematics competitions–AMC

work page 2023
[18]

American invitational mathematics examination (AIME)

MAA. American invitational mathematics examination (AIME). https://maa.org/, 2026. MAA. American invitational mathematics examination–AIME

work page 2026
[19]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Blog

work page 2024
[20]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[21]

arXiv preprint arXiv:2603.11682 , year=

Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, and Philipp Krähenbühl. Entropy-preserving reinforcement learning.arXiv preprint arXiv:2603.11682, 2026

work page arXiv 2026
[22]

Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

work page arXiv 2025
[23]

Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

work page arXiv 2026
[24]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[25]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[29]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

work page 2018
[30]

arXiv preprint arXiv:2602.02710 , year=

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710, 2026

work page arXiv 2026
[31]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

work page arXiv 2025
[32]

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.Advances in Neural Information Processing Systems, 39, 2025

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.Advances in Neural Information P...

work page 2025
[34]

Quantile advantage estimation for entropy-safe reasoning.arXiv preprint arXiv:2509.22611, 2025

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, and Xiangnan He. Quantile advantage estimation for entropy-safe reasoning.arXiv preprint arXiv:2509.22611, 2025

work page arXiv 2025
[35]

A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025

work page arXiv 2025
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Srpo: A cross-domain imple- mentation of large-scale reinforcement learning on llm.arXiv preprint arXiv:2504.14286, 2025

Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, et al. Srpo: A cross-domain imple- mentation of large-scale reinforcement learning on llm.arXiv preprint arXiv:2504.14286, 2025

work page arXiv 2025
[39]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

work page arXiv 2025
[40]

Stabilizing reinforcement learning with llms: Formulation and practices

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

work page arXiv 2025
[41]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

E y„µ « Rpyq Tÿ t“1 ` ρt ´1 ˘ ´ Γt`1 ´Γ pNq t`1 ¯ff “E y„µ « Rpyq Tÿ t“1 ` ρt ´1 ˘ ΓpNq t`1 pQmt`1 ´1q ff ďξ Tÿ t“1 Eµ

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=ftVlLG9cks. 12 A Related Works Reinforcement learning with verifiable rewards (RLVR).R...

work page 2026
[43]

0.56 on the token However and stays low across the whole self-correction phrase “However, this does not fit the context. . . Let’s re-evaluate

0.999 1.08 1.083´0.500 884¨Let 1.000 1.05 1.052´0.500 885’s 1.000 1.05 1.052´0.500 886¨re 0.986 1.05 1.052´0.500 887-e 1.063 0.94 0.942´0.500 888valuate 1.000 0.94 0.942´0.500 This case is the symmetric counterpart of Case 2. The trajectory ultimately fails (reward0) and the global advantage is therefore negative ( ˆAt “ ´0.500). A purely local surrogate ...

work page 2000

[1] [1]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025

[3] [3]

Reasoning with exploration: An entropy perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

work page 2026

[4] [4]

GPG: A simple and strong reinforcement learning baseline for model reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=inccdtfx8x

work page 2026

[5] [5]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024

[9] [9]

Online causal kalman filtering for stable and effective policy optimization.arXiv preprint arXiv:2602.10609, 2026

Shuo He, Lang Feng, Xin Cheng, Lei Feng, and Bo An. Online causal kalman filtering for stable and effective policy optimization.arXiv preprint arXiv:2602.10609, 2026

work page arXiv 2026

[10] [10]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021

[11] [11]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InProceedings of the nineteenth international conference on machine learning, pages 267–274, 2002. 10

work page 2002

[12] [12]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022

[13] [13]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[14] [14]

CPPO: Accelerating the training of group relative policy optimization-based reasoning models

ZhiHang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. CPPO: Accelerating the training of group relative policy optimization-based reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=SVHerutWxp

work page 2026

[15] [15]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

work page arXiv 2026

[17] [17]

American mathematics competitions (AMC)

MAA. American mathematics competitions (AMC). https://maa.org/, 2023. MAA. American mathematics competitions–AMC

work page 2023

[18] [18]

American invitational mathematics examination (AIME)

MAA. American invitational mathematics examination (AIME). https://maa.org/, 2026. MAA. American invitational mathematics examination–AIME

work page 2026

[19] [19]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Blog

work page 2024

[20] [20]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[21] [21]

arXiv preprint arXiv:2603.11682 , year=

Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, and Philipp Krähenbühl. Entropy-preserving reinforcement learning.arXiv preprint arXiv:2603.11682, 2026

work page arXiv 2026

[22] [22]

Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

work page arXiv 2025

[23] [23]

Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

work page arXiv 2026

[24] [24]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[25] [25]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[26] [26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025

[29] [29]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

work page 2018

[30] [30]

arXiv preprint arXiv:2602.02710 , year=

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710, 2026

work page arXiv 2026

[31] [31]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

work page arXiv 2025

[32] [32]

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.Advances in Neural Information Processing Systems, 39, 2025

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.Advances in Neural Information P...

work page 2025

[34] [34]

Quantile advantage estimation for entropy-safe reasoning.arXiv preprint arXiv:2509.22611, 2025

Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, and Xiangnan He. Quantile advantage estimation for entropy-safe reasoning.arXiv preprint arXiv:2509.22611, 2025

work page arXiv 2025

[35] [35]

A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025

work page arXiv 2025

[36] [36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Srpo: A cross-domain imple- mentation of large-scale reinforcement learning on llm.arXiv preprint arXiv:2504.14286, 2025

Xiaojiang Zhang, Jinghui Wang, Zifei Cheng, Wenhao Zhuang, Zheng Lin, Minglei Zhang, Shaojie Wang, Yinghan Cui, Chao Wang, Junyi Peng, et al. Srpo: A cross-domain imple- mentation of large-scale reinforcement learning on llm.arXiv preprint arXiv:2504.14286, 2025

work page arXiv 2025

[39] [39]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

work page arXiv 2025

[40] [40]

Stabilizing reinforcement learning with llms: Formulation and practices

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025

work page arXiv 2025

[41] [41]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

E y„µ « Rpyq Tÿ t“1 ` ρt ´1 ˘ ´ Γt`1 ´Γ pNq t`1 ¯ff “E y„µ « Rpyq Tÿ t“1 ` ρt ´1 ˘ ΓpNq t`1 pQmt`1 ´1q ff ďξ Tÿ t“1 Eµ

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview. net/forum?id=ftVlLG9cks. 12 A Related Works Reinforcement learning with verifiable rewards (RLVR).R...

work page 2026

[43] [43]

0.56 on the token However and stays low across the whole self-correction phrase “However, this does not fit the context. . . Let’s re-evaluate

0.999 1.08 1.083´0.500 884¨Let 1.000 1.05 1.052´0.500 885’s 1.000 1.05 1.052´0.500 886¨re 0.986 1.05 1.052´0.500 887-e 1.063 0.94 0.942´0.500 888valuate 1.000 0.94 0.942´0.500 This case is the symmetric counterpart of Case 2. The trajectory ultimately fails (reward0) and the global advantage is therefore negative ( ˆAt “ ´0.500). A purely local surrogate ...

work page 2000