Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

Barbara Hammer; Fabian Hinder; Felix St\"orck

arxiv: 2606.11797 · v1 · pith:QMGXYHSXnew · submitted 2026-06-10 · 💻 cs.LG

Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

Felix St\"orck , Fabian Hinder , Barbara Hammer This is my paper

Pith reviewed 2026-06-27 10:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords non-stationary reinforcement learningforgetting mechanismsvalue decaydeep Q-networkssoft actor-criticdrift adaptationuncertaintyvalue functions

0 comments

The pith

Space-sampled Value Decay supplies an explicit forgetting rule that lets value-based deep RL adapt to drifting environments without any change signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Space-sampled Value Decay to address non-stationary reinforcement learning where environment parameters change over time but no drift information is supplied. It draws on rodent studies showing that forgetting can support adaptation under uncertainty and applies the idea to modify standard value-based methods. Experiments on altered DQN and SAC agents in non-stationary settings show measurable gains in some cases alongside clear limits on final returns. The central claim is that this sampling-based decay mechanism can mitigate drift effects through simple forgetting alone.

Core claim

Space-sampled Value Decay is presented as an explicit forgetting mechanism for value-based deep RL architectures. When added to DQN and SAC, the method produces positive effects on handling non-stationary environments while also revealing limitations in the returns that can be achieved.

What carries the argument

Space-sampled Value Decay, a mechanism that samples values across state space and applies decay to older estimates so that agents can drop outdated information without external change cues.

If this is right

Value-based agents can adapt to environmental drift using only the uncertainty already present in their value estimates.
Simple decay rules can be inserted into existing DQN and SAC implementations without requiring task IDs or context variables.
Forgetting through space sampling reduces the impact of outdated rewards or transitions on current policy performance.
The same decay principle may apply to other value-based architectures beyond the two tested here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the decay rule to policy-gradient methods could test whether the forgetting benefit is specific to value functions.
Environments with abrupt rather than gradual drift would provide a sharper test of whether sampling alone suffices.
If the decay rate can be learned from data, the method might reduce the need for manual tuning in new tasks.

Load-bearing premise

The tested non-stationary environments have drift patterns that value decay alone can offset even when the agent receives no information about when or how the environment changes.

What would settle it

A controlled run on one of the paper's non-stationary benchmarks in which the Space-sampled Value Decay versions of DQN or SAC produce equal or lower returns than the unmodified baselines across multiple random seeds.

Figures

Figures reproduced from arXiv: 2606.11797 by Barbara Hammer, Fabian Hinder, Felix St\"orck.

**Figure 2.** Figure 2: Comparison of DQN approaches for different environments (left to right). Evaluation of LimitedDQN (no updates after a certain timestep), DQN (default settings in RL zoo (Raffin, 2020)) and DQN F (default DQN + SsVD). LimitedSAC(0) SAC(1) SAC F(2) 0 200000 400000 0 500 1000 InvertedPendulum-v5 0 200000 400000 0 50 MountainCarContinuous-v0 0 1 2 ×10 6 0 2000 4000 6000 Ant-v5 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 3.** Figure 3: Same experimental setup as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study: same experimental setup as [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-stationary Reinforcement Learning (NSRL) deals with adapting state-of-the-art RL methods to deal with changing environments: these however usually require (partially) perfect information about the drift such as ``task IDs'' or ``context''. To mitigate the effects of drift, this work develops \emph{Space-sampled Value Decay} as an explicit forgetting mechanism for value-based deep RL architectures as a simple yet effective approach. In particular we demonstrate and discuss positive effects but also limitations in achieved returns for modifications of Deep Q-networks (DQN) and Soft Actor-Critic (SAC) when evaluated on non-stationary environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward Space-sampled Value Decay as a forgetting trick for value-based RL in drifting environments without needing task IDs, but the abstract supplies no equations or numbers so the actual performance remains uncheckable.

read the letter

The core claim is that Space-sampled Value Decay gives a simple forgetting mechanism for DQN and SAC variants that helps with non-stationary environments when no change information is supplied. That direction matches a real need in NSRL, where most existing fixes assume partial knowledge of the drift.

The work does two things cleanly. It draws an explicit link from rodent forgetting studies to value-function decay, and it avoids the common requirement for context or task IDs. The modifications to the two standard algorithms are described at a high level, which at least makes the idea easy to try.

The soft spots are straightforward. The abstract contains no equations for how the space sampling works, no description of the non-stationary test environments, and no quantitative results or baseline comparisons. Without those, it is impossible to tell whether the reported positive effects outweigh the mentioned limitations in returns or whether the method actually outperforms simpler decay baselines already in the literature. The rodent-to-RL translation is also left as an assumption rather than tested.

This is a subfield paper aimed at researchers already working on adaptive agents. A reader who needs concrete forgetting techniques for drifting MDPs could get something usable if the full experiments hold up, but the current text does not yet give enough to decide. The paper deserves a serious referee so the authors can supply the missing implementation details and results; the underlying problem is legitimate even if the current evidence is thin.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Space-sampled Value Decay as an explicit forgetting mechanism for value-based deep RL architectures (modifications of DQN and SAC) to handle non-stationary environments with environmental drift but no information about the change. Drawing inspiration from rodent studies on adaptation under uncertainty, the work claims to demonstrate positive effects alongside limitations in achieved returns when evaluated on non-stationary environments.

Significance. If the mechanism and results hold under scrutiny, the contribution would lie in offering a simple, context-free forgetting strategy for NSRL that does not rely on task IDs or explicit drift detection. This could broaden applicability of value-based methods in drifting settings and provide a biologically motivated baseline for comparison with more complex adaptation techniques.

major comments (1)

Abstract: The abstract provides no equations, experimental details, data, or results to verify whether the proposed mechanism actually supports the stated positive effects; assessment impossible from available text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their comments on our manuscript. We address the single major comment below and are prepared to revise the abstract accordingly.

read point-by-point responses

Referee: Abstract: The abstract provides no equations, experimental details, data, or results to verify whether the proposed mechanism actually supports the stated positive effects; assessment impossible from available text.

Authors: We acknowledge that the provided abstract is concise and omits specific equations, experimental details, and quantitative results. This is standard for abstracts in the field to remain brief, with all technical details (including the SSVD formulation, DQN/SAC modifications, non-stationary environment setups, and results on returns) reserved for the main text. We agree this can make standalone assessment of the abstract difficult and will revise it in the next version to include a brief mention of the mechanism and the nature of the observed positive but limited effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or first-principles claims that could reduce to inputs by construction. The work introduces Space-sampled Value Decay as an explicit mechanism and reports empirical effects on modified DQN/SAC without any fitted-parameter predictions presented as independent results or self-citation chains supporting uniqueness. The central contribution is therefore self-contained as a proposed architectural modification evaluated on non-stationary environments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no details on any free parameters, axioms, or invented entities are provided.

pith-pipeline@v0.9.1-grok · 5682 in / 1022 out tokens · 27482 ms · 2026-06-27T10:54:22.284439+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Policy and Value Transfer in Lifelong Reinforcement Learning , booktitle =

Abel, David and Jinnai, Yuu and Guo, Sophie Yue and Konidaris, George and Littman, Michael , editor =. Policy and Value Transfer in Lifelong Reinforcement Learning , booktitle =
[2]

International Conference on Learning Representations , author =

Prevalence of Negative Transfer in Continual Reinforcement Learning:. International Conference on Learning Representations , author =
[3]

Proceedings of the National Academy of Sciences , volume =

Mice Exhibit Stochastic and Efficient Action Switching during Probabilistic Decision Making , author =. Proceedings of the National Academy of Sciences , volume =
[4]

, year = 2020, month = sep, number =

Chandak, Yash and Theocharous, Georgios and Shankar, Shiv and White, Martha and Mahadevan, Sridhar and Thomas, Philip S. , year = 2020, month = sep, number =. Optimizing for the. arXiv , langid =:2005.08158 , primaryclass =

work page arXiv 2020
[5]

Gu, Shangding and Shi, Laixi and Wen, Muning and Jin, Ming and Mazumdar, Eric and Chi, Yuejie and Wierman, Adam and Spanos, Costas , year = 2025, month = feb, number =. Robust. arXiv , langid =:2502.19652 , primaryclass =

work page arXiv 2025
[6]

Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey , year = 2018, month = aug, number =. Soft. arXiv , langid =:1801.01290 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Proceedings of the 35th International Conference on Machine Learning , author =

Soft Actor-Critic:. Proceedings of the 35th International Conference on Machine Learning , author =
[8]

Haarnoja, Tuomas and Zhou, Aurick and Hartikainen, Kristian and Tucker, George and Ha, Sehoon and Tan, Jie and Kumar, Vikash and Zhu, Henry and Gupta, Abhishek and Abbeel, Pieter and Levine, Sergey , year = 2019, month = jan, number =. Soft. arXiv , langid =:1812.05905 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

Double Q-Learning , booktitle =

Hasselt, Hado , editor =. Double Q-Learning , booktitle =
[10]

Validation of

Ito, Makoto and Doya, Kenji , year = 2009, month = aug, journal =. Validation of

2009
[11]

Forgetting in

Kato, Ayaka and Morita, Kenji , year = 2016, month = oct, journal =. Forgetting in

2016
[12]

and Luo, Baiting and Bektas, Iliyas and Zhang, Yunuo and Wray, Kyle Hollins and Laszka, Aron and Dubey, Abhishek and Mukhopadhyay, Ayan , year = 2025, month = jan, number =

Keplinger, Nathaniel S. and Luo, Baiting and Bektas, Iliyas and Zhang, Yunuo and Wray, Kyle Hollins and Laszka, Aron and Dubey, Abhishek and Mukhopadhyay, Ayan , year = 2025, month = jan, number =. arXiv , langid =:2501.09646 , primaryclass =

work page arXiv 2025
[13]

Proceedings of the National Academy of Sciences , volume =

Overcoming Catastrophic Forgetting in Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =
[14]

Liu, Yueyang and Kuang, Xu and Roy, Benjamin Van , year = 2023, month = jul, number =. A. arXiv , langid =:2302.12202 , primaryclass =

work page arXiv 2023
[15]

Nature , volume =

Human-Level Control through Deep Reinforcement Learning , author =. Nature , volume =
[16]

Raffin, Antonin , year = 2020, publisher =

2020
[17]

Smooth Exploration for Robotic Reinforcement Learning , booktitle =

Raffin, Antonin and Kober, Jens and Stulp, Freek , editor =. Smooth Exploration for Robotic Reinforcement Learning , booktitle =
[18]

Stable-Baselines3:

Raffin, Antonin and Hill, Ashley and Gleave, Adam and Kanervisto, Anssi and Ernestus, Maximilian and Dormann, Noah , year = 2021, journal =. Stable-Baselines3:

2021
[19]

Experience Replay for Continual Learning , booktitle =

Rolnick, David and Ahuja, Arun and Schwarz, Jonathan and Lillicrap, Timothy and Wayne, Gregory , editor =. Experience Replay for Continual Learning , booktitle =
[20]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , year = 2017, month = aug, number =. Proximal. arXiv , keywords =:1707.06347 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Welcome to the

Silver, David and Sutton, Richard S , year = 2025, abstract =. Welcome to the

2025
[22]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A Standard Interface for Reinforcement Learning Environments , author =. arXiv preprint arXiv:2407.17032 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv
[23]

and LaValle, S.M

Yershova, A. and LaValle, S.M. , year = 2004, volume =. Deterministic Sampling Methods for Spheres and

2004

[1] [1]

Policy and Value Transfer in Lifelong Reinforcement Learning , booktitle =

Abel, David and Jinnai, Yuu and Guo, Sophie Yue and Konidaris, George and Littman, Michael , editor =. Policy and Value Transfer in Lifelong Reinforcement Learning , booktitle =

[2] [2]

International Conference on Learning Representations , author =

Prevalence of Negative Transfer in Continual Reinforcement Learning:. International Conference on Learning Representations , author =

[3] [3]

Proceedings of the National Academy of Sciences , volume =

Mice Exhibit Stochastic and Efficient Action Switching during Probabilistic Decision Making , author =. Proceedings of the National Academy of Sciences , volume =

[4] [4]

, year = 2020, month = sep, number =

Chandak, Yash and Theocharous, Georgios and Shankar, Shiv and White, Martha and Mahadevan, Sridhar and Thomas, Philip S. , year = 2020, month = sep, number =. Optimizing for the. arXiv , langid =:2005.08158 , primaryclass =

work page arXiv 2020

[5] [5]

Gu, Shangding and Shi, Laixi and Wen, Muning and Jin, Ming and Mazumdar, Eric and Chi, Yuejie and Wierman, Adam and Spanos, Costas , year = 2025, month = feb, number =. Robust. arXiv , langid =:2502.19652 , primaryclass =

work page arXiv 2025

[6] [6]

Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey , year = 2018, month = aug, number =. Soft. arXiv , langid =:1801.01290 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Proceedings of the 35th International Conference on Machine Learning , author =

Soft Actor-Critic:. Proceedings of the 35th International Conference on Machine Learning , author =

[8] [8]

Haarnoja, Tuomas and Zhou, Aurick and Hartikainen, Kristian and Tucker, George and Ha, Sehoon and Tan, Jie and Kumar, Vikash and Zhu, Henry and Gupta, Abhishek and Abbeel, Pieter and Levine, Sergey , year = 2019, month = jan, number =. Soft. arXiv , langid =:1812.05905 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

Double Q-Learning , booktitle =

Hasselt, Hado , editor =. Double Q-Learning , booktitle =

[10] [10]

Validation of

Ito, Makoto and Doya, Kenji , year = 2009, month = aug, journal =. Validation of

2009

[11] [11]

Forgetting in

Kato, Ayaka and Morita, Kenji , year = 2016, month = oct, journal =. Forgetting in

2016

[12] [12]

and Luo, Baiting and Bektas, Iliyas and Zhang, Yunuo and Wray, Kyle Hollins and Laszka, Aron and Dubey, Abhishek and Mukhopadhyay, Ayan , year = 2025, month = jan, number =

Keplinger, Nathaniel S. and Luo, Baiting and Bektas, Iliyas and Zhang, Yunuo and Wray, Kyle Hollins and Laszka, Aron and Dubey, Abhishek and Mukhopadhyay, Ayan , year = 2025, month = jan, number =. arXiv , langid =:2501.09646 , primaryclass =

work page arXiv 2025

[13] [13]

Proceedings of the National Academy of Sciences , volume =

Overcoming Catastrophic Forgetting in Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =

[14] [14]

Liu, Yueyang and Kuang, Xu and Roy, Benjamin Van , year = 2023, month = jul, number =. A. arXiv , langid =:2302.12202 , primaryclass =

work page arXiv 2023

[15] [15]

Nature , volume =

Human-Level Control through Deep Reinforcement Learning , author =. Nature , volume =

[16] [16]

Raffin, Antonin , year = 2020, publisher =

2020

[17] [17]

Smooth Exploration for Robotic Reinforcement Learning , booktitle =

Raffin, Antonin and Kober, Jens and Stulp, Freek , editor =. Smooth Exploration for Robotic Reinforcement Learning , booktitle =

[18] [18]

Stable-Baselines3:

Raffin, Antonin and Hill, Ashley and Gleave, Adam and Kanervisto, Anssi and Ernestus, Maximilian and Dormann, Noah , year = 2021, journal =. Stable-Baselines3:

2021

[19] [19]

Experience Replay for Continual Learning , booktitle =

Rolnick, David and Ahuja, Arun and Schwarz, Jonathan and Lillicrap, Timothy and Wayne, Gregory , editor =. Experience Replay for Continual Learning , booktitle =

[20] [20]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , year = 2017, month = aug, number =. Proximal. arXiv , keywords =:1707.06347 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Welcome to the

Silver, David and Sutton, Richard S , year = 2025, abstract =. Welcome to the

2025

[22] [22]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A Standard Interface for Reinforcement Learning Environments , author =. arXiv preprint arXiv:2407.17032 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

and LaValle, S.M

Yershova, A. and LaValle, S.M. , year = 2004, volume =. Deterministic Sampling Methods for Spheres and

2004