Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang , Congbo Ma , Ian Reid , Mohammad Yaqub

Authors on Pith no claims yet

classification 💻 cs.LG

keywords advantagegrpokrpobaselinefiltergroupkalmanlanguage

read the original abstract

The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training reward curves and final accuracy over GRPO. These results suggest that adaptive advantage estimation is a promising direction for critic-free reinforcement learning in language model reasoning. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.