Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
Pith reviewed 2026-05-22 15:46 UTC · model grok-4.3
The pith
KRPO applies a 1D Kalman filter to estimate a latent prompt-level reward baseline from group rewards, yielding improved training performance over GRPO on math reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KRPO treats each group's reward as a noisy observation of a single latent prompt-level reward baseline whose evolution follows a linear Gaussian model. A one-dimensional Kalman filter is then used to recursively estimate this baseline and the associated uncertainty, which replaces the simple group mean in the advantage calculation of GRPO. The resulting method integrates directly into existing GRPO pipelines with no extra learned weights and negligible extra cost. On standard mathematical reasoning benchmarks the modified training produces higher reward curves and better final accuracy than vanilla GRPO.
What carries the argument
The 1D Kalman filter that recursively updates an estimate of the latent prompt-level reward baseline from successive group reward observations.
Load-bearing premise
Group rewards can be viewed as noisy samples drawn from a single latent prompt-level baseline that changes according to a simple linear Gaussian process.
What would settle it
A controlled experiment comparing KRPO and GRPO on the same math reasoning benchmarks under identical training conditions would falsify the claim if KRPO shows no improvement or a decrease in final accuracy or reward curves.
Figures
read the original abstract
The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training reward curves and final accuracy over GRPO. These results suggest that adaptive advantage estimation is a promising direction for critic-free reinforcement learning in language model reasoning. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) as a lightweight extension to Group Relative Policy Optimization (GRPO). It treats per-group rewards as noisy observations of a single latent prompt-level reward baseline whose dynamics are modeled by a 1D linear-Gaussian state-space model, then applies a Kalman filter to estimate both the baseline and its uncertainty for advantage normalization. The authors report that this yields improved training reward curves and higher final accuracy on mathematical reasoning benchmarks relative to standard GRPO, with no additional learned parameters and only minimal computational overhead.
Significance. If the reported gains prove robust and attributable to the Kalman structure rather than generic smoothing, the work demonstrates a simple, parameter-free route to adaptive baseline estimation in critic-free RL for language models. The absence of new trainable parameters and the public release of code are clear strengths that support reproducibility and potential adoption.
major comments (2)
- [Method] The central modeling assumption—that per-group rewards can be treated as noisy observations of a latent prompt-level baseline evolving according to a linear-Gaussian process suitable for a 1D Kalman filter—is load-bearing for the claim that gains arise from the proposed estimator rather than incidental smoothing. For the sparse, discrete (often 0/1) correctness signals typical in mathematical reasoning, this distributional mismatch is not diagnosed or ablated in the manuscript.
- [Experiments] The experimental section provides no quantitative details on group sizes, number of independent runs, statistical significance tests, or direct comparisons against other variance-reduction baselines (e.g., exponential moving average or learned critics). Without these, it is difficult to determine whether the reported improvements over GRPO are reliable or generalizable.
minor comments (2)
- The abstract and results would benefit from explicit reporting of the Kalman filter process noise Q and measurement noise R values used, along with any sensitivity analysis.
- Notation for the state transition and observation models should be introduced with numbered equations to improve clarity for readers unfamiliar with Kalman filtering.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method] The central modeling assumption—that per-group rewards can be treated as noisy observations of a latent prompt-level baseline evolving according to a linear-Gaussian process suitable for a 1D Kalman filter—is load-bearing for the claim that gains arise from the proposed estimator rather than incidental smoothing. For the sparse, discrete (often 0/1) correctness signals typical in mathematical reasoning, this distributional mismatch is not diagnosed or ablated in the manuscript.
Authors: We agree that the linear-Gaussian assumption represents an approximation when applied to discrete 0/1 reward signals. The Kalman filter is used here to recursively estimate a latent continuous prompt-level baseline together with its uncertainty, which provides adaptive normalization beyond fixed smoothing. To directly address whether the observed gains derive from the Kalman structure rather than generic smoothing, we will add an ablation comparing KRPO to an exponential moving average baseline of comparable computational cost. We will also include a brief discussion of the reward distribution observed in our math reasoning tasks and how the filter behaves in practice. revision: yes
-
Referee: [Experiments] The experimental section provides no quantitative details on group sizes, number of independent runs, statistical significance tests, or direct comparisons against other variance-reduction baselines (e.g., exponential moving average or learned critics). Without these, it is difficult to determine whether the reported improvements over GRPO are reliable or generalizable.
Authors: We acknowledge that the current experimental reporting lacks several important details. In the revised manuscript we will explicitly state the group sizes employed, report results aggregated over multiple independent runs with different random seeds, and include statistical significance tests (e.g., paired t-tests) for the accuracy differences versus GRPO. We will further add direct comparisons against an exponential moving average baseline and, where computationally feasible, against a simple learned critic to better isolate the contribution of the Kalman filter. revision: yes
Circularity Check
No circularity: standard Kalman filter applied to new modeling assumption
full rationale
The paper's core derivation applies the standard 1D Kalman filter prediction and update equations to a modeling assumption that per-group rewards are noisy observations of a latent prompt-level baseline evolving under linear-Gaussian dynamics. This assumption is introduced as a new modeling choice for advantage estimation in GRPO, not derived from or equivalent to any fitted parameter, self-defined quantity, or prior self-citation within the paper. No load-bearing step reduces by construction to the inputs; the method adds no learned parameters and the claimed improvements are presented as empirical outcomes on benchmarks rather than algebraic identities. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Kalman filter process and measurement noise parameters
axioms (1)
- domain assumption Group rewards are noisy observations of a latent prompt-level reward baseline
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight 1D Kalman-filter-based baseline estimator that adaptively tracks both the latent baseline and its uncertainty... For the prediction step: x̂i|i−1 = x̂i−1|i−1, Pi|i−1 = Pi−1|i−1 + Q; Update: Ki = Pi|i−1 / (Pi|i−1 + R), x̂i|i = x̂i|i−1 + Ki(ri − x̂i|i−1)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the grouped reward observations... are not sparse. According to the Central Limit Theorem, the sum of i.i.d. sampled rewards tends to follow a Gaussian distribution. This is consistent with the assumptions of our KRPO setting.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
-
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
-
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[4]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Addressing function approximation error in actor-critic methods
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018
work page 2018
-
[6]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Soft Actor-Critic Algorithms and Applications
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Rlaif: Scaling reinforcement learning from human feedback with ai feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023
work page 2023
-
[10]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Continuous control with deep reinforcement learning
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Asynchronous methods for deep reinforce- ment learning
V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016
work page 1928
-
[14]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[15]
Open-Thought. Tiny-grpo math tasks dataset. https://github.com/open-thought/ tiny-grpo/blob/main/data/math_tasks.jsonl, 2024. Accessed: 2025-05-04. 10
work page 2024
-
[16]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[17]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023
work page 2023
-
[18]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[19]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999
work page 1999
-
[22]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012
work page 2012
-
[23]
Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024
-
[24]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Deep reinforcement learning with double q-learning
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016
work page 2016
-
[26]
Hemish Veeraboina. Aime problem set: 1983–2024. Kaggle dataset, 2024
work page 1983
-
[27]
Hu Wang, Hao Chen, Qi Wu, Congbo Ma, and Yidong Li. Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021
work page 2021
-
[28]
Soft expert reward learning for vision-and-language navigation
Hu Wang, Qi Wu, and Chunhua Shen. Soft expert reward learning for vision-and-language navigation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 126–141. Springer, 2020
work page 2020
-
[29]
Dueling network architectures for deep reinforcement learning
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016
work page 1995
-
[30]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback.Advances in Neural Information Processing Systems, 36:10935–10950, 2023. 11 A Datasets Arithmetic Dataset[ 16]. It contains 100,000 arithmetic problems involving addition, subtraction, multiplication, and divisio...
work page 2023
-
[32]
In contrast, the proposed KRPO can get the correct answer 1. For this question, the KRPO 14 Table 5: Case study for the thinking process of GRPO and the proposed KRPO model. Question {“type": “Algebra", “question": “If 74x = 343 , what is the value of 74x−3", “ex- pected_answer": “1"} Model Thinking Process GRPO ✗ First, let’s rewrite74x = 343 as 74x = 7 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.