arxiv: 2604.23056 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

Recognition: unknown

K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

Zixuan Xia , Quanxi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy gradientKalman filterreward normalizationvariance reductiononline estimationCartPoleLunarLander

0 comments

The pith

A 1D Kalman filter replaces reward normalization in policy gradient reinforcement learning by online estimation of the latent reward mean.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using a simple one-dimensional Kalman filter to recursively estimate the mean of observed returns as a replacement for heuristic reward normalization in policy gradient methods. This filter treats each return as a noisy measurement of an underlying slowly varying value, producing a smoothed signal that adapts automatically to non-stationary reward distributions. Experiments on LunarLander and CartPole show faster convergence during training and lower variance across multiple runs compared with standard normalization baselines. A sympathetic reader would care because reward scaling choices frequently introduce instability and extra hyperparameters in reinforcement learning, and a recursive principled alternative could reduce that tuning burden while maintaining or improving performance.

Core claim

The central claim is that a 1D Kalman filter can be integrated into the reward pipeline to recursively estimate the latent reward mean from noisy returns, yielding an adaptive and variance-reduced signal that is fed directly into policy gradient updates without any changes to the policy network architecture.

What carries the argument

A 1D Kalman filter applied to the stream of observed returns, which maintains and updates estimates of the reward mean and its uncertainty through recursive prediction and correction steps.

If this is right

Convergence occurs faster on LunarLander and CartPole than with conventional reward normalization.
Training variance across runs decreases when the Kalman estimate is used in place of normalized rewards.
The estimate adapts automatically to changes in reward scale without manual retuning of normalization parameters.
No modifications to existing policy or value network architectures are required.
Computational overhead stays negligible because the filter operates in one dimension with simple recursive updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering idea could be applied to advantage estimates or temporal-difference errors to reduce variance in other parts of the learning process.
Environments with deliberately drifting or non-stationary rewards would serve as stronger tests of the adaptation property.
Optimal Kalman noise parameters might be learned jointly with the policy rather than set by hand.
Similar recursive smoothing could address noisy signals in sequential decision problems outside reinforcement learning.

Load-bearing premise

Observed returns behave as noisy measurements around a slowly varying latent reward mean that the filter can track accurately enough to improve policy updates without adding bias or instability.

What would settle it

Running the Kalman-filtered reward method on LunarLander and finding no faster convergence or no reduction in training variance compared with standard normalization would refute the claimed advantage.

Figures

Figures reproduced from arXiv: 2604.23056 by Quanxi Li, Zixuan Xia.

**Figure 1.** Figure 1: Actor-Critic on CartPole using Z-score normal view at source ↗

**Figure 4.** Figure 4: GAE on CartPole using Kalman normalization. view at source ↗

**Figure 2.** Figure 2: Actor-Critic on CartPole using Kalman normal view at source ↗

**Figure 5.** Figure 5: Actor-Critic on LunarLander with Kalman normal view at source ↗

**Figure 3.** Figure 3: GAE on CartPole using Z-score normalization. view at source ↗

**Figure 6.** Figure 6: PPO on CartPole with Adaptive Kalman normal view at source ↗

**Figure 7.** Figure 7: PPO on CartPole using Z-score normalization. view at source ↗

**Figure 8.** Figure 8: PPO on CartPole with Simple Kalman normaliza view at source ↗

read the original abstract

We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high-variance returns and adapting to non-stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on \textit{LunarLander} and \textit{CartPole} demonstrate that Kalman-filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at https://github.com/Sumxiaa/Kalman_Normalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes K-Score, a method that integrates a 1D Kalman filter to recursively estimate the latent mean of observed rewards in policy-gradient RL. This is presented as a principled, low-overhead alternative to heuristic reward normalization that smooths variance and adapts to non-stationary reward scales without altering policy architectures. Experiments on LunarLander and CartPole are reported to show faster convergence and lower training variance than standard normalization.

Significance. If the central claim holds, the work supplies a simple, theoretically grounded filtering approach to reward scaling that could be broadly useful in non-stationary RL settings. Public code availability is a clear strength that supports reproducibility. The overall significance remains moderate because the empirical support is confined to two classic control tasks and the manuscript provides no quantitative metrics, run counts, or statistical tests.

major comments (2)

[§2] §2 (online Kalman recursion and policy-gradient integration): The description states that the filter 'recursively estimates the latent reward mean' and that the estimate is 'fed into the policy gradient update.' It is not specified whether the measurement update at timestep t uses the reward r_t generated by the action a_t sampled from the current policy. This timing detail is load-bearing because any dependence between the baseline-like term and the sampled action risks introducing bias into the policy-gradient estimator, violating the usual independence requirement for unbiasedness.
[Experiments] Experiments section: The claims that Kalman-filtered rewards 'significantly accelerate convergence and reduce training variance' are presented without any numerical results, number of independent runs, exact baseline implementations (e.g., running-mean normalization), or statistical tests. Because the central empirical claim rests on these comparisons, the absence of such details prevents assessment of whether the reported gains are robust or merely anecdotal.

minor comments (2)

[Abstract] The abstract and title use 'K-Score' and 'Kalman-filtered rewards' interchangeably; a single consistent term would improve clarity.
[§2] The Kalman filter equations would benefit from explicit recursion formulas (including the roles of process-noise and measurement-noise variances) rather than high-level description only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important clarifications needed for the algorithmic description and strengthens the empirical evaluation. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [§2] §2 (online Kalman recursion and policy-gradient integration): The description states that the filter 'recursively estimates the latent reward mean' and that the estimate is 'fed into the policy gradient update.' It is not specified whether the measurement update at timestep t uses the reward r_t generated by the action a_t sampled from the current policy. This timing detail is load-bearing because any dependence between the baseline-like term and the sampled action risks introducing bias into the policy-gradient estimator, violating the usual independence requirement for unbiasedness.

Authors: We appreciate the referee pointing out this critical implementation detail regarding unbiasedness. In the K-Score approach, the 1D Kalman filter follows a standard predict-then-update recursion. At timestep t, the policy gradient for action a_t is computed using the current predicted reward mean estimate (prior to the measurement update). The reward r_t is observed only after the action is taken and the gradient step is performed; the measurement update then incorporates r_t to refine the estimate for future timesteps. This ordering ensures the baseline term remains independent of a_t. We will revise §2 to include explicit pseudocode and a step-by-step timing diagram clarifying this sequence. revision: yes
Referee: [Experiments] Experiments section: The claims that Kalman-filtered rewards 'significantly accelerate convergence and reduce training variance' are presented without any numerical results, number of independent runs, exact baseline implementations (e.g., running-mean normalization), or statistical tests. Because the central empirical claim rests on these comparisons, the absence of such details prevents assessment of whether the reported gains are robust or merely anecdotal.

Authors: We agree that the experimental claims require quantitative support and statistical rigor to be fully convincing. In the revised manuscript, we will expand the Experiments section with tables reporting mean performance and standard deviation across a minimum of 10 independent random seeds for both LunarLander and CartPole. We will precisely describe the baseline normalization (including whether it uses a fixed-window running mean, exponential moving average, or other variant, along with all hyperparameters). Convergence will be quantified via metrics such as episodes to reach a target return threshold, and we will include statistical comparisons (e.g., Welch's t-test or Mann-Whitney U test with p-values) between K-Score and the baselines. All code for reproducing these results is already public, and we will add the exact run counts and seed values to the text. revision: yes

Circularity Check

0 steps flagged

No circularity; direct application of standard Kalman filter recursion.

full rationale

The paper proposes integrating a 1D Kalman filter to recursively estimate the latent reward mean as an alternative to heuristic reward normalization. This is a straightforward substitution of an established filtering algorithm into the policy gradient update, with no claimed mathematical derivation that reduces to a self-referential definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. The central claim rests on empirical results from LunarLander and CartPole rather than any closed-form equivalence to its inputs. The online nature of the filter is presented as a feature for non-stationary environments, without any internal equations that force the reported gains by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard linear-Gaussian assumptions of the Kalman filter plus two tunable noise variances that must be chosen for each environment; no new entities are introduced.

free parameters (2)

process noise variance
Controls how quickly the filter believes the latent reward mean can change; chosen by the user or tuned on the task.
measurement noise variance
Models the expected noise level in each observed return; chosen by the user or tuned on the task.

axioms (1)

domain assumption The sequence of returns can be modeled as noisy observations of a latent constant or slowly varying mean.
This is the modeling premise required to apply the 1D Kalman filter recursion to reward estimation.

pith-pipeline@v0.9.0 · 5398 in / 1238 out tokens · 42640 ms · 2026-05-08T11:59:49.456364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 2 internal anchors

[1]

P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K

Mnih, V ., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–

1928
[2]

and Parr, R

Painter-Wakefield, C. and Parr, R. Greedy algorithms for sparse reinforcement learning.arXiv preprint arXiv:1206.6485,

work page arXiv
[3]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. InInternational conference on machine learning, pp. 1889–1897. PMLR, 2015a. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015b. Sc...

work page internal anchor Pith review arXiv 2025
[4]

Shashua, S. D.-C. and Mannor, S. Trust region value optimization using kalman filtering.arXiv preprint arXiv:1901.07860,

work page arXiv 1901
[5]

Deep reinforcement learning and the deadly triad

Van Hasselt, H., Doron, Y ., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648,

work page arXiv
[6]

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Wang, H., Ma, C., Reid, I., and Yaqub, M. Kalman filter enhanced grpo for reinforcement learning-based language model reasoning.arXiv preprint arXiv:2505.07527,

work page internal anchor Pith review Pith/arXiv arXiv