Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Congbo Ma; Hu Wang; Ian Reid; Mohammad Yaqub

arxiv: 2505.07527 · v5 · submitted 2025-05-12 · 💻 cs.LG

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang , Congbo Ma , Ian Reid , Mohammad Yaqub This is my paper

Pith reviewed 2026-05-22 15:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords Kalman filterGRPOreinforcement learninglanguage model reasoningadvantage estimationbaseline estimationmathematical reasoningpolicy optimization

0 comments

The pith

KRPO applies a 1D Kalman filter to estimate a latent prompt-level reward baseline from group rewards, yielding improved training performance over GRPO on math reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that replacing the simple average baseline in GRPO with a Kalman-filtered estimate of a latent reward level produces more reliable advantage signals. GRPO normalizes advantages using the mean reward within each small group of rollouts for the same prompt. When groups are small or rewards are noisy, this mean can vary a lot and hurt learning. The Kalman filter treats each new group reward as an observation of an underlying baseline that evolves smoothly, and it maintains both the current estimate and its uncertainty. If this works, training reward curves stay higher and final accuracy on math problems improves while adding almost no cost.

Core claim

KRPO treats each group's reward as a noisy observation of a single latent prompt-level reward baseline whose evolution follows a linear Gaussian model. A one-dimensional Kalman filter is then used to recursively estimate this baseline and the associated uncertainty, which replaces the simple group mean in the advantage calculation of GRPO. The resulting method integrates directly into existing GRPO pipelines with no extra learned weights and negligible extra cost. On standard mathematical reasoning benchmarks the modified training produces higher reward curves and better final accuracy than vanilla GRPO.

What carries the argument

The 1D Kalman filter that recursively updates an estimate of the latent prompt-level reward baseline from successive group reward observations.

Load-bearing premise

Group rewards can be viewed as noisy samples drawn from a single latent prompt-level baseline that changes according to a simple linear Gaussian process.

What would settle it

A controlled experiment comparing KRPO and GRPO on the same math reasoning benchmarks under identical training conditions would falsify the claim if KRPO shows no improvement or a decrease in final accuracy or reward curves.

Figures

Figures reproduced from arXiv: 2505.07527 by Congbo Ma, Hu Wang, Ian Reid, Mohammad Yaqub.

**Figure 2.** Figure 2: The curves of returns/rewards within a batch for different difficulty levels of questions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The training return curves of additional (a) base models and (b) datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The training return curves of different (a) group sizes and (b) seeds. The base model is Llama-3.2-1B-Instruct on normal level of Arithmetic data. We next examine when KRPO is most helpful. Figure 4a shows that the performance gap between KRPO and GRPO widens when fewer rollouts are available per prompt, suggesting that the raw group mean becomes less reliable in small-group regimes. Figure 4b shows that t… view at source ↗

**Figure 5.** Figure 5: The training return curves of different (a) KL divergence loss weights and (b) process and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The training return curves of (a) Llama3.2-3B-Instruct model and (b) Qwen2.5-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: The curves of (a) KL divergence; (b) normalized gradients of GRPO and the proposed [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training reward curves and final accuracy over GRPO. These results suggest that adaptive advantage estimation is a promising direction for critic-free reinforcement learning in language model reasoning. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) as a lightweight extension to Group Relative Policy Optimization (GRPO). It treats per-group rewards as noisy observations of a single latent prompt-level reward baseline whose dynamics are modeled by a 1D linear-Gaussian state-space model, then applies a Kalman filter to estimate both the baseline and its uncertainty for advantage normalization. The authors report that this yields improved training reward curves and higher final accuracy on mathematical reasoning benchmarks relative to standard GRPO, with no additional learned parameters and only minimal computational overhead.

Significance. If the reported gains prove robust and attributable to the Kalman structure rather than generic smoothing, the work demonstrates a simple, parameter-free route to adaptive baseline estimation in critic-free RL for language models. The absence of new trainable parameters and the public release of code are clear strengths that support reproducibility and potential adoption.

major comments (2)

[Method] The central modeling assumption—that per-group rewards can be treated as noisy observations of a latent prompt-level baseline evolving according to a linear-Gaussian process suitable for a 1D Kalman filter—is load-bearing for the claim that gains arise from the proposed estimator rather than incidental smoothing. For the sparse, discrete (often 0/1) correctness signals typical in mathematical reasoning, this distributional mismatch is not diagnosed or ablated in the manuscript.
[Experiments] The experimental section provides no quantitative details on group sizes, number of independent runs, statistical significance tests, or direct comparisons against other variance-reduction baselines (e.g., exponential moving average or learned critics). Without these, it is difficult to determine whether the reported improvements over GRPO are reliable or generalizable.

minor comments (2)

The abstract and results would benefit from explicit reporting of the Kalman filter process noise Q and measurement noise R values used, along with any sensitivity analysis.
Notation for the state transition and observation models should be introduced with numbered equations to improve clarity for readers unfamiliar with Kalman filtering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Method] The central modeling assumption—that per-group rewards can be treated as noisy observations of a latent prompt-level baseline evolving according to a linear-Gaussian process suitable for a 1D Kalman filter—is load-bearing for the claim that gains arise from the proposed estimator rather than incidental smoothing. For the sparse, discrete (often 0/1) correctness signals typical in mathematical reasoning, this distributional mismatch is not diagnosed or ablated in the manuscript.

Authors: We agree that the linear-Gaussian assumption represents an approximation when applied to discrete 0/1 reward signals. The Kalman filter is used here to recursively estimate a latent continuous prompt-level baseline together with its uncertainty, which provides adaptive normalization beyond fixed smoothing. To directly address whether the observed gains derive from the Kalman structure rather than generic smoothing, we will add an ablation comparing KRPO to an exponential moving average baseline of comparable computational cost. We will also include a brief discussion of the reward distribution observed in our math reasoning tasks and how the filter behaves in practice. revision: yes
Referee: [Experiments] The experimental section provides no quantitative details on group sizes, number of independent runs, statistical significance tests, or direct comparisons against other variance-reduction baselines (e.g., exponential moving average or learned critics). Without these, it is difficult to determine whether the reported improvements over GRPO are reliable or generalizable.

Authors: We acknowledge that the current experimental reporting lacks several important details. In the revised manuscript we will explicitly state the group sizes employed, report results aggregated over multiple independent runs with different random seeds, and include statistical significance tests (e.g., paired t-tests) for the accuracy differences versus GRPO. We will further add direct comparisons against an exponential moving average baseline and, where computationally feasible, against a simple learned critic to better isolate the contribution of the Kalman filter. revision: yes

Circularity Check

0 steps flagged

No circularity: standard Kalman filter applied to new modeling assumption

full rationale

The paper's core derivation applies the standard 1D Kalman filter prediction and update equations to a modeling assumption that per-group rewards are noisy observations of a latent prompt-level baseline evolving under linear-Gaussian dynamics. This assumption is introduced as a new modeling choice for advantage estimation in GRPO, not derived from or equivalent to any fitted parameter, self-defined quantity, or prior self-citation within the paper. No load-bearing step reduces by construction to the inputs; the method adds no learned parameters and the claimed improvements are presented as empirical outcomes on benchmarks rather than algebraic identities. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on a domain modeling assumption about reward structure and standard Kalman filter equations; no new entities are introduced and the only free parameters are the filter's noise covariances.

free parameters (1)

Kalman filter process and measurement noise parameters
These hyperparameters control how much the filter trusts the model versus the observed group rewards and must be chosen or tuned.

axioms (1)

domain assumption Group rewards are noisy observations of a latent prompt-level reward baseline
This modeling premise is required to justify applying the Kalman filter update to the baseline estimate.

pith-pipeline@v0.9.0 · 5709 in / 1175 out tokens · 37243 ms · 2026-05-22T15:46:03.153352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight 1D Kalman-filter-based baseline estimator that adaptively tracks both the latent baseline and its uncertainty... For the prediction step: x̂i|i−1 = x̂i−1|i−1, Pi|i−1 = Pi−1|i−1 + Q; Update: Ki = Pi|i−1 / (Pi|i−1 + R), x̂i|i = x̂i|i−1 + Ki(ri − x̂i|i−1)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the grouped reward observations... are not sparse. According to the Central Limit Theorem, the sum of i.i.d. sampled rewards tends to follow a Gaussian distribution. This is consistent with the assumptions of our KRPO setting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[4]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

work page 2018
[6]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

work page 2023
[10]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016

work page 1928
[14]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[15]

Tiny-grpo math tasks dataset

Open-Thought. Tiny-grpo math tasks dataset. https://github.com/open-thought/ tiny-grpo/blob/main/data/math_tasks.jsonl, 2024. Accessed: 2025-05-04. 10

work page 2024
[16]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[17]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[18]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[19]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999
[22]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012
[23]

Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024

work page arXiv 2024
[24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016

work page 2016
[26]

Aime problem set: 1983–2024

Hemish Veeraboina. Aime problem set: 1983–2024. Kaggle dataset, 2024

work page 1983
[27]

Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021

Hu Wang, Hao Chen, Qi Wu, Congbo Ma, and Yidong Li. Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021

work page 2021
[28]

Soft expert reward learning for vision-and-language navigation

Hu Wang, Qi Wu, and Chunhua Shen. Soft expert reward learning for vision-and-language navigation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 126–141. Springer, 2020

work page 2020
[29]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

work page 1995
[30]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

fixed value

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback.Advances in Neural Information Processing Systems, 36:10935–10950, 2023. 11 A Datasets Arithmetic Dataset[ 16]. It contains 100,000 arithmetic problems involving addition, subtraction, multiplication, and divisio...

work page 2023
[32]

type": “Algebra

In contrast, the proposed KRPO can get the correct answer 1. For this question, the KRPO 14 Table 5: Case study for the thinking process of GRPO and the proposed KRPO model. Question {“type": “Algebra", “question": “If 74x = 343 , what is the value of 74x−3", “ex- pected_answer": “1"} Model Thinking Process GRPO ✗ First, let’s rewrite74x = 343 as 74x = 7 ...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017

[4] [4]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

work page 2018

[6] [6]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

work page 2023

[10] [10]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016

work page 1928

[14] [14]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[15] [15]

Tiny-grpo math tasks dataset

Open-Thought. Tiny-grpo math tasks dataset. https://github.com/open-thought/ tiny-grpo/blob/main/data/math_tasks.jsonl, 2024. Accessed: 2025-05-04. 10

work page 2024

[16] [16]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[17] [17]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023

[18] [18]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[19] [19]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999

[22] [22]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012

[23] [23]

Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024

work page arXiv 2024

[24] [24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016

work page 2016

[26] [26]

Aime problem set: 1983–2024

Hemish Veeraboina. Aime problem set: 1983–2024. Kaggle dataset, 2024

work page 1983

[27] [27]

Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021

Hu Wang, Hao Chen, Qi Wu, Congbo Ma, and Yidong Li. Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021

work page 2021

[28] [28]

Soft expert reward learning for vision-and-language navigation

Hu Wang, Qi Wu, and Chunhua Shen. Soft expert reward learning for vision-and-language navigation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 126–141. Springer, 2020

work page 2020

[29] [29]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

work page 1995

[30] [30]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

fixed value

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback.Advances in Neural Information Processing Systems, 36:10935–10950, 2023. 11 A Datasets Arithmetic Dataset[ 16]. It contains 100,000 arithmetic problems involving addition, subtraction, multiplication, and divisio...

work page 2023

[32] [32]

type": “Algebra

In contrast, the proposed KRPO can get the correct answer 1. For this question, the KRPO 14 Table 5: Case study for the thinking process of GRPO and the proposed KRPO model. Question {“type": “Algebra", “question": “If 74x = 343 , what is the value of 74x−3", “ex- pected_answer": “1"} Model Thinking Process GRPO ✗ First, let’s rewrite74x = 343 as 74x = 7 ...

work page