pith. sign in

arxiv: 2605.26357 · v2 · pith:B7ILAIKGnew · submitted 2026-05-25 · 💻 cs.LG

Balancing Plasticity and Stability with Fast and Slow Successor Features

Pith reviewed 2026-06-29 22:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningreinforcement learningsuccessor featuressynaptic consolidationstability-plasticity dilemmanon-stationary environmentsmulti-timescale learninggradual drift
0
0 comments X

The pith

Stabilizing successor features with multi-timescale synaptic consolidation outperforms plasticity methods in gradually drifting environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the stability-plasticity dilemma for deep RL agents facing gradual rather than abrupt environmental change. It creates modified 3D Miniworld and MuJoCo settings with naturalistic continual drift to test consolidation versus resetting approaches. Findings indicate that synaptic consolidation applied to successor features produces better performance than consolidation of Q-values or parameter resets, and that consolidating across multiple timescales is most effective because the timescales together address different rates of environmental change. This points to stability mechanisms being particularly useful when non-stationarity unfolds slowly and continuously.

Core claim

In environments modified to feature gradual continual drift, applying neuro-inspired synaptic consolidation to successor features produces superior performance on continually changing tasks compared with methods that reset parameters or consolidate Q-values directly, with the largest gains arising when consolidation targets operate across multiple timescales that together capture complementary aspects of the drift.

What carries the argument

Successor Features stabilized via synaptic consolidation at multiple (fast and slow) timescales

If this is right

  • Stability-focused methods outperform plasticity-focused methods when environmental change occurs gradually rather than through discrete jumps.
  • Successor features serve as more effective consolidation targets than raw Q-values because they reduce interference across changing conditions.
  • Consolidation at multiple timescales captures complementary rates of environmental drift more effectively than any single timescale.
  • The performance advantage appears in both discrete navigation and continuous control domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-timescale consolidation principle could be tested on other predictive representations such as successor measures or latent dynamics models.
  • Robotics or autonomous driving systems that encounter slow seasonal or wear-induced drift might benefit from explicit fast-slow consolidation schedules.
  • Benchmarks that rely only on abrupt task boundaries may systematically underestimate the value of stability mechanisms.

Load-bearing premise

The artificial gradual drift added to the modified environments accurately models real-world non-stationarity without introducing implementation artifacts that favor the consolidation methods.

What would settle it

Running the same consolidation experiments but replacing the gradual drift with abrupt task switches and finding that multi-timescale successor-feature consolidation no longer outperforms single-timescale or Q-value consolidation.

Figures

Figures reproduced from arXiv: 2605.26357 by Blake Richards, Doina Precup, Raymond Chua.

Figure 1
Figure 1. Figure 1: Motivating stability-plasticity tradeoffs in naturalistic, continually non-stationary RL where the environment evolves gradually, rather than abruptly. To illustrate, we show (a) the Humanoid walking forward task and (b) an example of the noisy sine function used to generate smooth changes in its mass. (c) Average episode return plot and (d) Area under the curve (AUC) show that stability-preserving methods… view at source ↗
Figure 2
Figure 2. Figure 2: a: Neuro-inspired synaptic consolidation model adapted from (Benna & Fusi, 2016). The visible variable, u1, represents the synaptic efficacy v, while downstream hidden variables u2, u3, ... interact bidirectionally across timescales, with beaker capacities C1 < C2 < ..., < CK and tube widths representing flow strength g1,2 > g2,3 > ..., > gK,K+1 controlling the rate of interaction between the variables. To… view at source ↗
Figure 3
Figure 3. Figure 3: Results from Slippery Four Rooms with naturalistic, continual evolving slip dynamics that randomly replace actions. (a): Average return across two sequential tasks (Task 1 and 2), each repeated twice (Exposure 1 and 2). In DQN+P-last (yellow), plasticity injection is applied midway through training by randomly re-initializing the last layer’s parameters. (b): Steps to reach a predefined performance thresho… view at source ↗
Figure 4
Figure 4. Figure 4: Slippery Four Rooms For base models, we use Double Deep Q-Network (DQN) (Van Hasselt et al., 2016) for Slippery Four Rooms envi￾ronment and the Deterministic Policy Gradient algorithm (Silver et al., 2014) with twin critics for MuJoCo (TD3) (Fu￾jimoto et al., 2018). For SFs, we use Simple SFs (Chua et al., 2024) which can be learned without auxiliary losses. We selected these models due to their flexibilit… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Quantification of mass changes for the Humanoid embodiment. We consider three levels of mass dynamics variation: mild (25%), moderate (50%), and severe (100%), corresponding to the maximum change allowed before the physical simulation becomes unstable. Across these settings, plasticity-preserving methods (CBP, P-last) are less effective than approaches incorporating synaptic consolidation (SC). Under moder… view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of timescales in MuJoCo using our model, SF + SC. More consolidation variables (6–9) improve learning efficiency, highlighting the benefit of slower timescales. Zero variables correspond to the Simple SF agent. See Appendix M for results in the Slippery Four Rooms environment. 5.4. Robustness to Non-Periodic and Stochastic Drift To test robustness beyond periodic drift, we replace the noisy sine m… view at source ↗
Figure 8
Figure 8. Figure 8: Cross-Attention analysis of individual consolidations, replacing memory recall via backflow. (a): Implementation design. (b-c): Attention probabilities over consolidation variables, where higher probability indicates greater contribution to learning. Full results are provided in Appendix O. serve as keys and values. The softmax probabilities from the cross-attention mechanism (Figure 8b and c) showed that … view at source ↗
Figure 9
Figure 9. Figure 9: Capacity analysis using Humanoid. X-axis shows the number of parameters, while the y-axis shows performance mea￾sured by area under the curve (AUC). Increasing the parameter count of TD3 and its variants did not consistently improve perfor￾mance compared to SF + SC (star), suggesting the contribution of consolidating SFs beyond network capacity scaling alone. Since each consolidation variable introduces an… view at source ↗
Figure 10
Figure 10. Figure 10: (a) The slippery variant of the 3D Four Rooms environment. The agent alternates between two tasks: in Task 1, reaching the green box produces +1 reward and the yellow box -1, in Task 2, the reward assignment is reversed. At each step, the agent’s chosen action may be randomly replaced with a probability sampled from the noisy sine function shown in B. The agent receives only egocentric pixel observations.… view at source ↗
Figure 11
Figure 11. Figure 11: MuJoCo suite with continuous periodic mass changes during training and evaluation for (a) Humanoid, (b)Walker, (c) Quadruped, (d) Half-cheetah. A periodic noisy sine wave that generates continuously varying mass values, used to stochastically scale the agent’s mass during training and evaluation. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: (a) Partial view over the first 1 million environment steps and (b) complete view over the full 10 million environment training steps of the noisy non-periodic sine wave used to generate continuously varying slip probabilities that stochastically replace the agent’s intended actions. E.4. Continuous Control in MuJoCo (Non-Periodic) Humanoid Task: Walk Forward Quadruped Task: Run Forward (c) Walker Task: R… view at source ↗
Figure 13
Figure 13. Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (a) Partial view over the first 1 million environment steps and (b) complete view over the full 10 million training steps of the Ornstein–Uhlenbeck (OU) process used to generate varying slip probabilities that stochastically replace the agent’s intended actions. E.6. Continuous Control in MuJoCo (Ornstein–Uhlenbeck processes) Humanoid Task: Walk Forward Quadruped Task: Run Forward (c) Walker Task: Run For… view at source ↗
Figure 15
Figure 15. Figure 15 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Slippery Four Rooms environment We extended this environment used in the Simple Successor Features (Chua et al., 2024) which was built upon the original 3D Miniworld environment (Chevalier-Boisvert et al., 2023). In the slippery variant of the Four Rooms environment7 , we mimic wet or icy conditions in all four rooms, rather than just two rooms (top right and bottom left). The key difference is, unlike in… view at source ↗
Figure 17
Figure 17. Figure 17: Plasticity–stability analysis in the MuJoCo suite under continuous mass changes induced by a periodic noisy sine function during training and evaluation. We compare a baseline TD3 agent with three variants: (i) Continual Backprop (CBP), which selectively resets least-active weights; (ii) plasticity injection by resetting the weights in the last layer (P-last); and (iii) synaptic consolidation (SC). Across… view at source ↗
Figure 18
Figure 18. Figure 18: Plasticity–stability analysis in the Slippery Four Rooms environment using non-periodic noisy sinusoidal function. The agent undergoes two exposures; after each learning phase, the reward mapping is reversed. (a) Average return per episode. (b) Learning efficiency (steps to reach a good policy; lower is better). For the plasticity-injection agent, plasticity was injected once at 10 million environment ste… view at source ↗
Figure 19
Figure 19. Figure 19: Plasticity–stability analysis in the MuJoCo suite under continuous mass changes induced by a non-periodic noisy sine function during training and evaluation. We compare a baseline TD3 agent with three variants: (i) Continual Backprop (CBP), which selectively resets least-active weights; (ii) plasticity injection by resetting the weights in the last layer (P-last); and (iii) synaptic consolidation (SC). Ac… view at source ↗
Figure 20
Figure 20. Figure 20: Plasticity–stability analysis in the Slippery Four Rooms environment using OU processes. The agent undergoes two exposures; after each learning phase, the reward mapping is reversed. (a) Average return per episode. (b) Learning efficiency (steps to reach a good policy; lower is better). For the plasticity-injection agent, plasticity was injected once at 10 million environment steps (end of Exposure 1). In… view at source ↗
Figure 21
Figure 21. Figure 21: Plasticity–stability analysis in the MuJoCo suite under continuous mass changes induced by Ornstein–Uhlenbeck processes during training and evaluation. We compare a baseline TD3 agent with three variants: (i) Continual Backprop (CBP), which selectively resets least-active weights; (ii) plasticity injection by resetting the weights in the last layer (P-last); and (iii) synaptic consolidation (SC). Across e… view at source ↗
Figure 22
Figure 22. Figure 22: Schematic of synaptic consolidation applied to Q-values and to Successor Features (SFs). (a): (Kaplanis et al., 2018) showed that adapting the synaptic consolidation mechanism of (Benna & Fusi, 2016) to Q-values improves robustness in continual RL. (b): Here, we extend this approach to predictive, generalizable representations using Simple Successor Features (Chua et al., 2024). Consolidated variables (e.… view at source ↗
Figure 23
Figure 23. Figure 23: Comparison of Q-values and Successor Features (SFs), with synaptic consolidation (SC) or elastic weight consolidation (EWC), in the 3D Slippery Four Rooms environment during training and evaluation. Applying SC to Q-values (green) and SFs (purple) offers higher learning efficiency than their EWC counterparts, requiring fewer steps to learn a good policy. This demonstrates that SC is more effective than EW… view at source ↗
Figure 24
Figure 24. Figure 24: Comparison of Q-values and Successor Features (SFs), with and without synaptic consolidation, on the MuJoCo suite under continuous mass changes during training and evaluation. Interestingly, unlike Q-values, applying synaptic consolidation to SFs (purple) yields consistently higher learning efficiency. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Comparison of consolidating the parameters of Q-values and SFs using Synaptic Consolidation (SC) using the 3D Slippery Four Rooms environment. (left): Average episode return plot. (right): Number of training steps needed to reach a pre-determined good policy. Lesser steps the better. Applying SC to the SFs (purple) yields better learning performance overall. L.2. MuJoCo suite with periodic mass changes 31… view at source ↗
Figure 26
Figure 26. Figure 26: Comparison of consolidating the parameters of Q-values and SFs using Synaptic Consolidation using the MuJoCo suite. Interestingly, when compared to TD3 (blue), SFs (orange) learn well in Half-Cheetah and Walker but not Quadruped and Humanoid. This is probably due to higher complexity in Quadruped and Humanoid as they have larger state and action spaces. Overall, applying SC to the SFs (purple) yields bett… view at source ↗
Figure 27
Figure 27. Figure 27: Comparison of consolidating the parameters of Q-values and SFs using Synaptic Consolidation using the MuJoCo suite when embodiments undergo non-periodic mass changes. Interestingly, when compared to TD3 (blue), SFs (orange) learn well in Half-Cheetah and Walker but not Quadruped and Humanoid. This is probably due to higher complexity in Quadruped and Humanoid as they have larger state and action spaces. U… view at source ↗
Figure 28
Figure 28. Figure 28: Comparison of consolidating the parameters of Q-values and SFs using Synaptic Consolidation using the MuJoCo suite when embodiments under Ornstein-Uhlenbeck mass Changes. In this setting, applying SC to the SFs (purple) only yields better learning performance compared to applying SC to Q-values. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Analysis of fast and slow timescale variables in the 3D Slippery Four Rooms environment during training and evaluation. Using synaptic consolidation clearly leads to better learning efficiency, but there is no clear advantage between six and nine consolidation variables. M.2. MuJoCo suite Results 37 [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Analysis of fast and slow timescale variables on the MuJoCo suite under continuous mass changes during training and evaluation. Using more consolidation variables (six, eight or nine) yields consistently higher learning efficiency, highlighting the importance of slower-timescale variables. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Using cross-attention to recall information from the SF consolidation modules. (a: A high-level schematic on how the cross-attention mechanism is used. (b: The computations for the cross-attention mechanism. We used the reward weight vector w as the query, the SFs consolidation variables except the most plastic one as keys and values (SFu2 , SFu3 , . . . , SFuK ). Because these SFs consolidation variables… view at source ↗
Figure 32
Figure 32. Figure 32: Analysis of all consolidated variables using Cross-Attention during training in the 3D Slippery Four Rooms environment. The cross-attention probabilities indicate that fast and slow timescale variables were attended to similarly, suggesting nearly equal contribution. This may be due to the sparse reward structure in the 3D Slippery Four Rooms environment, which affects how discriminate the SFs are given t… view at source ↗
Figure 33
Figure 33. Figure 33: Analysis of all consolidated variables using cross-attention in the MuJoCo suite under continuous mass changes. Memory recall was performed solely through the cross-attention mechanism, rather than by waiting for information to propagate from slower to faster timescale variables. Unsurprisingly, faster timescale variables were attended to more than slower ones. Notably, Half-Cheetah and Walker benefited f… view at source ↗
Figure 34
Figure 34. Figure 34: Learning curves in the MuJoCo suite under continuous mass changes with cross-attention over consolidated variables. Faster timescale variables were generally attended to more strongly than slower ones as shown in [PITH_FULL_IMAGE:figures/full_fig_p042_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Simple SFs with synaptic consolidation architecture. Simple SFs were adapted from (Chua et al., 2024), with TD3 (Lillicrap et al., 2015) as base model. The synaptic consolidation variables are updated analytically (see section 4 for more details on the consolidation variables). We swept the task learning rate of the reward weight vector across the values of {10−5 , 10−6 , . . . , 10−10} when optimizing th… view at source ↗
Figure 36
Figure 36. Figure 36: Comparison of training throughput (FPS) for all models in the Slippery Four Rooms environment. Higher FPS reflects more efficient computation. 0 consolidation var 3 consolidation var 6 consolidation var 9 consolidation var 0 100 200 300 400 Frames Per Sec Computational Cost of SF Consolidation for Slippery Four Rooms Env [PITH_FULL_IMAGE:figures/full_fig_p047_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Comparison of training throughput (FPS) for different number of consolidation variables for the SFs within the slippery four rooms environment. Higher FPS reflects more efficient computation. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Comparison of training throughput (FPS) for all models in the humanoid embodiment within the MuJoCo environment. Higher FPS reflects more efficient computation. 0 consolidation var 3 consolidation var 6 consolidation var 9 consolidation var 0 20 40 60 80 100 120 Frames Per Sec Frames per Second During Training for Mujoco (Humanoid) [PITH_FULL_IMAGE:figures/full_fig_p048_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Comparison of training throughput (FPS) for different number of consolidation variables for the SFs within the humanoid embodiment within the MuJoCo environment. Higher FPS reflects more efficient computation. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Comparison of PPO (teal) with TD3, SF, and their variants with plasticity preservation or stability enhancement mechanisms under continuous mass changes. (Left) Average episode return over training. (Middle) Area under the curve (AUC) of the return, summarizing overall performance. (Right) Total number of environment samples used during training. While PPO leverages parallelized data collection and uses m… view at source ↗
Figure 41
Figure 41. Figure 41: Quantification of slippery dynamics in the 3D Four Rooms environment. (a) Average episode return, (b) minimum number of environment steps required to learn a successful policy, and (c) area under the curve (AUC) of the episode returns. We consider three levels of slippery probability variation: mild (25%), moderate (50%), and severe (100%), where the maximum corresponds to a 0.45 probability that the sele… view at source ↗
Figure 42
Figure 42. Figure 42: Quantification of mass changes for the Humanoid embodiment. We consider three levels of mass dynamics variation: mild (25%), moderate (50%), and severe (100%), corresponding to the maximum change allowed before the physical simulation becomes unstable. Across these settings, plasticity-preserving methods (CBP, P-last) are less effective than approaches incorporating synaptic consolidation (SC). Under mode… view at source ↗
Figure 43
Figure 43. Figure 43: Quantification of mass changes for the Quadruped embodiment. We consider three levels of mass dynamics variation: mild (25%), moderate (50%), and severe (100%), corresponding to the maximum change allowed before the physical simulation becomes unstable. Across these settings, plasticity-preserving methods (CBP, P-last) are less effective than approaches incorporating synaptic consolidation (SC). Under sev… view at source ↗
Figure 44
Figure 44. Figure 44: Quantification of mass changes for the Half-Cheetah embodiment. We consider three levels of mass dynamics variation: mild (25%), moderate (50%), and severe (100%), corresponding to the maximum change allowed before the physical simulation becomes unstable. Across these settings, plasticity-preserving methods (CBP, P-last) are less effective than approaches incorporating synaptic consolidation (SC). Under … view at source ↗
Figure 45
Figure 45. Figure 45: Quantification of mass changes for the Walker embodiment. We consider three levels of mass dynamics variation: mild (25%), moderate (50%), and severe (100%), corresponding to the maximum change allowed before the physical simulation becomes unstable. Across these settings, plasticity-preserving methods (CBP, P-last) are less effective than approaches incorporating synaptic consolidation (SC). Under severe… view at source ↗
read the original abstract

A hallmark of intelligence is the ability to adapt in non-stationary environments, yet deep Reinforcement Learning (RL) agents often struggle in such settings. Prior studies introduce non-stationarity through abrupt shifts in features or dynamics, whereas real-world environments often evolve gradually through continual drift. This distinction has important implications for the "stability-plasticity dilemma" in RL, as abrupt task changes may demand more plasticity than naturalistic settings. To address this, we modify existing 3D Miniworld and MuJoCo environments to incorporate naturalistic, continual non-stationarity, and use them to examine how stability and adaptation affect performance under continuous environmental change. We find that methods favoring stability, such as synaptic consolidation, outperform approaches focused on plasticity, such as parameters resetting. Motivated by this result, and prior evidence that Successor Features (SFs) reduce interference, we investigate whether SFs are better consolidation targets than Q-values. Across both environments, applying neuro-inspired synaptic consolidation to SFs yields superior performance on continually changing settings. Moreover, consolidation is most effective when SFs are stabilized across multiple timescales, which capture complementary aspects of gradual environmental change. Together, these results suggest that stability is more critical in continual learning when changes are gradual, and that multi-timescale consolidation of predictive representations is an effective approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript modifies 3D Miniworld and MuJoCo environments to include gradual, continual non-stationarity and compares synaptic consolidation applied to successor features (SFs) against plasticity-focused baselines such as parameter resetting. It reports that consolidation on SFs, particularly when performed across multiple timescales, yields superior performance under these drifting conditions and concludes that stability is more critical than plasticity for gradual environmental change.

Significance. If the empirical results are robust, the work supplies concrete evidence that predictive representations such as SFs can serve as effective consolidation targets and that multi-timescale stabilization captures complementary aspects of slow environmental drift. This would strengthen the case for stability-oriented mechanisms in continual RL and motivate further investigation of timescale-separated representations.

major comments (2)
  1. [Environment modification] Environment modification section: the functional form, rate, and scope of the introduced drift (whether applied to rewards, transitions, or visual features; linear, sinusoidal, or stochastic) are not specified. Because the central claim that multi-timescale SF consolidation outperforms plasticity baselines rests on these environments faithfully instantiating naturalistic gradual change, the absence of this detail leaves open the possibility that observed gains are artifacts of the particular drift implementation.
  2. [Experimental results] Experimental results: the abstract asserts empirical superiority of consolidation on SFs, yet the manuscript supplies no information on the number of independent runs, statistical tests performed, or controls for confounding implementation choices. Without these, the reported performance differences cannot be assessed as reliable support for the stability-plasticity claim.
minor comments (1)
  1. [Method] Notation for the fast and slow successor-feature components is introduced without an explicit equation relating them to the standard SF definition; adding this would clarify the multi-timescale construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional detail will strengthen the manuscript's clarity and support for its claims. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Environment modification] Environment modification section: the functional form, rate, and scope of the introduced drift (whether applied to rewards, transitions, or visual features; linear, sinusoidal, or stochastic) are not specified. Because the central claim that multi-timescale SF consolidation outperforms plasticity baselines rests on these environments faithfully instantiating naturalistic gradual change, the absence of this detail leaves open the possibility that observed gains are artifacts of the particular drift implementation.

    Authors: We agree that precise specification of the drift is essential for reproducibility and to substantiate that the environments capture gradual naturalistic change. In the revised manuscript we will expand the Environment modification section with the exact functional forms, rates, scopes (rewards, transitions, visual features), and any stochastic components used in both the 3D Miniworld and MuJoCo setups. revision: yes

  2. Referee: [Experimental results] Experimental results: the abstract asserts empirical superiority of consolidation on SFs, yet the manuscript supplies no information on the number of independent runs, statistical tests performed, or controls for confounding implementation choices. Without these, the reported performance differences cannot be assessed as reliable support for the stability-plasticity claim.

    Authors: We acknowledge that reporting the number of independent runs, statistical tests, and controls is required to evaluate reliability. The revised manuscript will include these details (number of random seeds, statistical tests with p-values, and controls for implementation choices) in the Experimental results section and figure captions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; paper is purely empirical.

full rationale

The manuscript contains no derivations, equations, or fitted parameters presented as predictions. All claims rest on experimental comparisons of consolidation methods versus baselines in modified environments. These results are externally falsifiable through replication and do not reduce to self-definition, self-citation load-bearing, or renaming of known results. The central premise (superiority of multi-timescale SF consolidation under gradual drift) is supported by performance metrics rather than by construction from its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5760 in / 1022 out tokens · 24739 ms · 2026-06-29T22:13:55.392498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 17 canonical work pages · 9 internal anchors

  1. [1]

    Abbas, Z., Zhao, R., Modayil, J., White, A., and Machado, M. C. Loss of plasticity in continual deep reinforcement learning. In Conference on lifelong learning agents, pp.\ 620--636. PMLR, 2023

  2. [2]

    P., and Singh, S

    Abel, D., Barreto, A., Van Roy, B., Precup, D., van Hasselt, H. P., and Singh, S. A definition of continual reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 50377--50407, 2023

  3. [3]

    and Precup, D

    Anand, N. and Precup, D. Prediction and control in continual reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 63779--63817, 2023

  4. [4]

    J., Schaul, T., van Hasselt, H

    Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017

  5. [5]

    G., Naddaf, Y., Veness, J., and Bowling, M

    Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of artificial intelligence research, 47: 0 253--279, 2013

  6. [6]

    K., Kolouri, S., and Soltoggio, A

    Ben-Iwhiwhu, E., Nath, S., Pilly, P. K., Kolouri, S., and Soltoggio, A. Lifelong reinforcement learning with modulating masks. arXiv preprint arXiv:2212.11110, 2022

  7. [7]

    Benna, M. K. and Fusi, S. Computational principles of synaptic memory consolidation. Nature neuroscience, 19 0 (12): 0 1697--1706, 2016

  8. [8]

    Experiment tracking with weights and biases, 2020

    Biewald, L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com

  9. [9]

    Universal Successor Features Approximators

    Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., Van Hasselt, H., Silver, D., and Schaul, T. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018

  10. [10]

    J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q

    Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

  11. [11]

    Task-agnostic continual reinforcement learning: Gaining insights and overcoming challenges

    Caccia, M., Mueller, J., Kim, T., Charlin, L., and Fakoor, R. Task-agnostic continual reinforcement learning: Gaining insights and overcoming challenges. In Conference on Lifelong Learning Agents, pp.\ 89--119. PMLR, 2023

  12. [12]

    S., and Terry, J

    Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, R., Willems, L., Lahlou, S., Pal, S., Castro, P. S., and Terry, J. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR, abs/2306.13831, 2023

  13. [13]

    A., and Precup, D

    Chua, R., Ghosh, A., Kaplanis, C., Richards, B. A., and Precup, D. Learning successor features the simple way. Advances in Neural Information Processing Systems, 37: 0 49957--50030, 2024

  14. [14]

    F., Lan, Q., Rahman, P., Mahmood, A

    Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  16. [16]

    French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3 0 (4): 0 128--135, 1999

  17. [17]

    Addressing function approximation error in actor-critic methods, 2018

    Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods, 2018

  18. [18]

    J raph: A library for graph neural networks in jax., 2020

    Godwin*, J., Keck*, T., Battaglia, P., Bapst, V., Kipf, T., Li, Y., Stachenfeld, K., Veli c kovi\' c , P., and Sanchez-Gonzalez, A. J raph: A library for graph neural networks in jax., 2020. URL http://github.com/deepmind/jraph

  19. [19]

    F lax: A neural network library and ecosystem for JAX , 2024

    Heek, J., Levskaya, A., Oliver, A., Ritter, M., Rondepierre, B., Steiner, A., and van Z ee, M. F lax: A neural network library and ecosystem for JAX , 2024. URL http://github.com/google/flax

  20. [20]

    Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9 0 (3): 0 90--95, 2007. doi:10.1109/MCSE.2007.55

  21. [21]

    Continual reinforcement learning with complex synapses

    Kaplanis, C., Shanahan, M., and Clopath, C. Continual reinforcement learning with complex synapses. In International Conference on Machine Learning, pp.\ 2497--2506. PMLR, 2018

  22. [22]

    Policy Consolidation for Continual Reinforcement Learning

    Kaplanis, C., Shanahan, M., and Clopath, C. Policy consolidation for continual reinforcement learning. arXiv preprint arXiv:1902.00255, 2019

  23. [23]

    Zaletel, and Joel E

    Kaplanis, C., Clopath, C., and Shanahan, M. Continual reinforcement learning with multi-timescale replay (2020). DOI: https://doi. org/10.48550/arXiv, 2020

  24. [24]

    Towards continual reinforcement learning: A review and perspectives

    Khetarpal, K., Riemer, M., Rish, I., and Precup, D. Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research, 75: 0 1401--1476, 2022

  25. [25]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  26. [26]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  27. [27]

    A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 0 (13): 0 3521--3526, 2017

  28. [28]

    Jupyter notebooks -- a publishing format for reproducible computational workflows

    Kluyver, T., Ragan-Kelley, B., P \'e rez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., and Willing, C. Jupyter notebooks -- a publishing format for reproducible computational workflows. In Loizides, F. and Schmidt, B. (eds.), Positioning and Power in Academic Publishing:...

  29. [29]

    Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks

    Lee, H., Cho, H., Kim, H., Kim, D., Min, D., Choo, J., and Lyle, C. Slow and steady wins the race: Maintaining plasticity with hare and tortoise networks. arXiv preprint arXiv:2406.02596, 2024

  30. [30]

    Continuous control with deep reinforcement learning

    Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  31. [31]

    Disentangling the causes of plasticity loss in neural networks

    Lyle, C., Zheng, Z., Khetarpal, K., van Hasselt, H., Pascanu, R., Martens, J., and Dabney, W. Disentangling the causes of plasticity loss in neural networks. arXiv preprint arXiv:2402.18762, 2024

  32. [32]

    L., McNaughton, B

    McClelland, J. L., McNaughton, B. L., and O'Reilly, R. C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102 0 (3): 0 419, 1995

  33. [33]

    and Cohen, N

    McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.\ 109--165. Elsevier, 1989

  34. [34]

    The primacy bias in deep reinforcement learning

    Nikishin, E., Schwarzer, M., D’Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp.\ 16828--16847. PMLR, 2022

  35. [35]

    Deep reinforcement learning with plasticity injection

    Nikishin, E., Oh, J., Ostrovski, G., Lyle, C., Pascanu, R., Dabney, W., and Barreto, A. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36: 0 37142--37159, 2023

  36. [36]

    Pytorch: An imperative style, high-performance deep learning library

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  37. [37]

    Self-activating neural ensembles for continual reinforcement learning

    Powers, S., Xing, E., and Gupta, A. Self-activating neural ensembles for continual reinforcement learning. In Conference on Lifelong Learning Agents, pp.\ 683--704. PMLR, 2022

  38. [38]

    Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference

    Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., and Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018

  39. [39]

    Experience replay for continual learning

    Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., and Wayne, G. Experience replay for continual learning. Advances in neural information processing systems, 32, 2019

  40. [40]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    W., Pascanu, R., and Hadsell, R

    Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pp.\ 4528--4537. PMLR, 2018

  42. [42]

    and Sutton, R

    Silver, D. and Sutton, R. S. Welcome to the era of experience. Google AI, 1, 2025

  43. [43]

    Deterministic policy gradient algorithms

    Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International conference on machine learning, pp.\ 387--395. Pmlr, 2014

  44. [44]

    S., and Evci, U

    Sokar, G., Agarwal, R., Castro, P. S., and Evci, U. The dormant neuron phenomenon in deep reinforcement learning. In International Conference on Machine Learning, pp.\ 32145--32168. PMLR, 2023

  45. [45]

    Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018

  46. [46]

    Mujoco: A physics engine for model-based control

    Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

  47. [47]

    dm\_control: Software and tasks for continuous control

    Tunyasuvunakool, S., Muldal, A., Doron, Y., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap, T., Heess, N., and Tassa, Y. dm\_control: Software and tasks for continuous control. Software Impacts, 6: 0 100022, 2020

  48. [48]

    Deep reinforcement learning with double q-learning

    Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, 2016

  49. [49]

    and Drake, F

    Van Rossum, G. and Drake, F. L. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 2009. ISBN 1441412697

  50. [50]

    Waskom, M. L. seaborn: statistical data visualization. Journal of Open Source Software, 6 0 (60): 0 3021, 2021. doi:10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021

  51. [51]

    Deep reinforcement learning amidst lifelong non-stationarity

    Xie, A., Harrison, J., and Finn, C. Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv:2006.10701, 2020

  52. [52]

    Hydra - a framework for elegantly configuring complex applications

    Yadan, O. Hydra - a framework for elegantly configuring complex applications. Github, 2019. URL https://github.com/facebookresearch/hydra

  53. [53]

    Mastering visual continuous control: Improved data-augmented reinforcement learning

    Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021

  54. [54]

    Continual learning through synaptic intelligence

    Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. In International conference on machine learning, pp.\ 3987--3995. PMLR, 2017

  55. [55]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...