pith. sign in

arxiv: 2505.07527 · v5 · submitted 2025-05-12 · 💻 cs.LG

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Pith reviewed 2026-05-22 15:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords Kalman filterGRPOreinforcement learninglanguage model reasoningadvantage estimationbaseline estimationmathematical reasoningpolicy optimization
0
0 comments X

The pith

KRPO applies a 1D Kalman filter to estimate a latent prompt-level reward baseline from group rewards, yielding improved training performance over GRPO on math reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that replacing the simple average baseline in GRPO with a Kalman-filtered estimate of a latent reward level produces more reliable advantage signals. GRPO normalizes advantages using the mean reward within each small group of rollouts for the same prompt. When groups are small or rewards are noisy, this mean can vary a lot and hurt learning. The Kalman filter treats each new group reward as an observation of an underlying baseline that evolves smoothly, and it maintains both the current estimate and its uncertainty. If this works, training reward curves stay higher and final accuracy on math problems improves while adding almost no cost.

Core claim

KRPO treats each group's reward as a noisy observation of a single latent prompt-level reward baseline whose evolution follows a linear Gaussian model. A one-dimensional Kalman filter is then used to recursively estimate this baseline and the associated uncertainty, which replaces the simple group mean in the advantage calculation of GRPO. The resulting method integrates directly into existing GRPO pipelines with no extra learned weights and negligible extra cost. On standard mathematical reasoning benchmarks the modified training produces higher reward curves and better final accuracy than vanilla GRPO.

What carries the argument

The 1D Kalman filter that recursively updates an estimate of the latent prompt-level reward baseline from successive group reward observations.

Load-bearing premise

Group rewards can be viewed as noisy samples drawn from a single latent prompt-level baseline that changes according to a simple linear Gaussian process.

What would settle it

A controlled experiment comparing KRPO and GRPO on the same math reasoning benchmarks under identical training conditions would falsify the claim if KRPO shows no improvement or a decrease in final accuracy or reward curves.

Figures

Figures reproduced from arXiv: 2505.07527 by Congbo Ma, Hu Wang, Ian Reid, Mohammad Yaqub.

Figure 1
Figure 1. Figure 1: The curves of returns/rewards within a batch for different difficulty levels of questions [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The curves of returns/rewards within a batch for different difficulty levels of questions [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The training return curves of additional (a) base models and (b) datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The training return curves of different (a) group sizes and (b) seeds. The base model is Llama-3.2-1B-Instruct on normal level of Arithmetic data. We next examine when KRPO is most helpful. Figure 4a shows that the performance gap between KRPO and GRPO widens when fewer rollouts are available per prompt, suggesting that the raw group mean becomes less reliable in small-group regimes. Figure 4b shows that t… view at source ↗
Figure 5
Figure 5. Figure 5: The training return curves of different (a) KL divergence loss weights and (b) process and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The training return curves of (a) Llama3.2-3B-Instruct model and (b) Qwen2.5-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The curves of (a) KL divergence; (b) normalized gradients of GRPO and the proposed [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training reward curves and final accuracy over GRPO. These results suggest that adaptive advantage estimation is a promising direction for critic-free reinforcement learning in language model reasoning. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) as a lightweight extension to Group Relative Policy Optimization (GRPO). It treats per-group rewards as noisy observations of a single latent prompt-level reward baseline whose dynamics are modeled by a 1D linear-Gaussian state-space model, then applies a Kalman filter to estimate both the baseline and its uncertainty for advantage normalization. The authors report that this yields improved training reward curves and higher final accuracy on mathematical reasoning benchmarks relative to standard GRPO, with no additional learned parameters and only minimal computational overhead.

Significance. If the reported gains prove robust and attributable to the Kalman structure rather than generic smoothing, the work demonstrates a simple, parameter-free route to adaptive baseline estimation in critic-free RL for language models. The absence of new trainable parameters and the public release of code are clear strengths that support reproducibility and potential adoption.

major comments (2)
  1. [Method] The central modeling assumption—that per-group rewards can be treated as noisy observations of a latent prompt-level baseline evolving according to a linear-Gaussian process suitable for a 1D Kalman filter—is load-bearing for the claim that gains arise from the proposed estimator rather than incidental smoothing. For the sparse, discrete (often 0/1) correctness signals typical in mathematical reasoning, this distributional mismatch is not diagnosed or ablated in the manuscript.
  2. [Experiments] The experimental section provides no quantitative details on group sizes, number of independent runs, statistical significance tests, or direct comparisons against other variance-reduction baselines (e.g., exponential moving average or learned critics). Without these, it is difficult to determine whether the reported improvements over GRPO are reliable or generalizable.
minor comments (2)
  1. The abstract and results would benefit from explicit reporting of the Kalman filter process noise Q and measurement noise R values used, along with any sensitivity analysis.
  2. Notation for the state transition and observation models should be introduced with numbered equations to improve clarity for readers unfamiliar with Kalman filtering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] The central modeling assumption—that per-group rewards can be treated as noisy observations of a latent prompt-level baseline evolving according to a linear-Gaussian process suitable for a 1D Kalman filter—is load-bearing for the claim that gains arise from the proposed estimator rather than incidental smoothing. For the sparse, discrete (often 0/1) correctness signals typical in mathematical reasoning, this distributional mismatch is not diagnosed or ablated in the manuscript.

    Authors: We agree that the linear-Gaussian assumption represents an approximation when applied to discrete 0/1 reward signals. The Kalman filter is used here to recursively estimate a latent continuous prompt-level baseline together with its uncertainty, which provides adaptive normalization beyond fixed smoothing. To directly address whether the observed gains derive from the Kalman structure rather than generic smoothing, we will add an ablation comparing KRPO to an exponential moving average baseline of comparable computational cost. We will also include a brief discussion of the reward distribution observed in our math reasoning tasks and how the filter behaves in practice. revision: yes

  2. Referee: [Experiments] The experimental section provides no quantitative details on group sizes, number of independent runs, statistical significance tests, or direct comparisons against other variance-reduction baselines (e.g., exponential moving average or learned critics). Without these, it is difficult to determine whether the reported improvements over GRPO are reliable or generalizable.

    Authors: We acknowledge that the current experimental reporting lacks several important details. In the revised manuscript we will explicitly state the group sizes employed, report results aggregated over multiple independent runs with different random seeds, and include statistical significance tests (e.g., paired t-tests) for the accuracy differences versus GRPO. We will further add direct comparisons against an exponential moving average baseline and, where computationally feasible, against a simple learned critic to better isolate the contribution of the Kalman filter. revision: yes

Circularity Check

0 steps flagged

No circularity: standard Kalman filter applied to new modeling assumption

full rationale

The paper's core derivation applies the standard 1D Kalman filter prediction and update equations to a modeling assumption that per-group rewards are noisy observations of a latent prompt-level baseline evolving under linear-Gaussian dynamics. This assumption is introduced as a new modeling choice for advantage estimation in GRPO, not derived from or equivalent to any fitted parameter, self-defined quantity, or prior self-citation within the paper. No load-bearing step reduces by construction to the inputs; the method adds no learned parameters and the claimed improvements are presented as empirical outcomes on benchmarks rather than algebraic identities. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on a domain modeling assumption about reward structure and standard Kalman filter equations; no new entities are introduced and the only free parameters are the filter's noise covariances.

free parameters (1)
  • Kalman filter process and measurement noise parameters
    These hyperparameters control how much the filter trusts the model versus the observed group rewards and must be chosen or tuned.
axioms (1)
  • domain assumption Group rewards are noisy observations of a latent prompt-level reward baseline
    This modeling premise is required to justify applying the Kalman filter update to the baseline estimate.

pith-pipeline@v0.9.0 · 5709 in / 1175 out tokens · 37243 ms · 2026-05-22T15:46:03.153352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight 1D Kalman-filter-based baseline estimator that adaptively tracks both the latent baseline and its uncertainty... For the prediction step: x̂i|i−1 = x̂i−1|i−1, Pi|i−1 = Pi−1|i−1 + Q; Update: Ki = Pi|i−1 / (Pi|i−1 + R), x̂i|i = x̂i|i−1 + Ki(ri − x̂i|i−1)

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    the grouped reward observations... are not sparse. According to the Central Limit Theorem, the sum of i.i.d. sampled rewards tends to follow a Gaussian distribution. This is consistent with the assumptions of our KRPO setting.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.

  2. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.

  3. K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  3. [3]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  4. [4]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  5. [5]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

  6. [6]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

  9. [9]

    Rlaif: Scaling reinforcement learning from human feedback with ai feedback

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

  10. [10]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  11. [11]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

  12. [12]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  13. [13]

    Asynchronous methods for deep reinforce- ment learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016

  14. [14]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

  15. [15]

    Tiny-grpo math tasks dataset

    Open-Thought. Tiny-grpo math tasks dataset. https://github.com/open-thought/ tiny-grpo/blob/main/data/math_tasks.jsonl, 2024. Accessed: 2025-05-04. 10

  16. [16]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  17. [17]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  18. [18]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  19. [19]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  20. [20]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  21. [21]

    Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

  22. [22]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  23. [23]

    Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024

    Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset.arXiv preprint arXiv: Arxiv-2402.10176, 2024

  24. [24]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  25. [25]

    Deep reinforcement learning with double q-learning

    Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016

  26. [26]

    Aime problem set: 1983–2024

    Hemish Veeraboina. Aime problem set: 1983–2024. Kaggle dataset, 2024

  27. [27]

    Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021

    Hu Wang, Hao Chen, Qi Wu, Congbo Ma, and Yidong Li. Multi-intersection traffic optimisation: A benchmark dataset and a strong baseline.IEEE Open Journal of Intelligent Transportation Systems, 3:126–136, 2021

  28. [28]

    Soft expert reward learning for vision-and-language navigation

    Hu Wang, Qi Wu, and Chunhua Shen. Soft expert reward learning for vision-and-language navigation. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 126–141. Springer, 2020

  29. [29]

    Dueling network architectures for deep reinforcement learning

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

  30. [30]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  31. [31]

    fixed value

    Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback.Advances in Neural Information Processing Systems, 36:10935–10950, 2023. 11 A Datasets Arithmetic Dataset[ 16]. It contains 100,000 arithmetic problems involving addition, subtraction, multiplication, and divisio...

  32. [32]

    type": “Algebra

    In contrast, the proposed KRPO can get the correct answer 1. For this question, the KRPO 14 Table 5: Case study for the thinking process of GRPO and the proposed KRPO model. Question {“type": “Algebra", “question": “If 74x = 343 , what is the value of 74x−3", “ex- pected_answer": “1"} Model Thinking Process GRPO ✗ First, let’s rewrite74x = 343 as 74x = 7 ...