Recognition: unknown
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
read the original abstract
The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training reward curves and final accuracy over GRPO. These results suggest that adaptive advantage estimation is a promising direction for critic-free reinforcement learning in language model reasoning. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
-
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.