Recognition: unknown
Learning to Forget: Continual Learning with Adaptive Weight Decay
Pith reviewed 2026-05-07 10:32 UTC · model grok-4.3
The pith
Per-parameter adaptive weight decay enables selective forgetting to balance new learning and retention in continual learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.
What carries the argument
Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent to produce selective forgetting.
If this is right
- FADE automatically discovers distinct decay rates for different parameters.
- It complements step-size adaptation in online settings.
- It consistently improves over fixed weight decay in online tracking and streaming classification.
- Weight decay can function as a controlled forgetting mechanism that frees capacity for new knowledge.
Where Pith is reading between the lines
- If the meta-gradient approach scales stably, FADE could be extended to internal layers of deeper networks for broader continual learning gains.
- Similar per-parameter adaptation ideas might transfer to other forms of regularization or memory management in streaming scenarios.
- Direct comparison of learned decay rates against manually tuned schedules could clarify when adaptation provides the largest benefit.
Load-bearing premise
Approximate meta-gradient descent on per-parameter decay rates remains stable and computationally tractable when applied beyond the final layer or in deeper networks.
What would settle it
Applying FADE to multiple layers of a deep network and observing instability, divergence, or worse performance than fixed decay would show the central claim does not hold generally.
Figures
read the original abstract
Continual learning agents with finite capacity must balance acquiring new knowledge with retaining the old. This requires controlled forgetting of knowledge that is no longer needed, freeing up capacity to learn. Weight decay, viewed as a mechanism for forgetting, can serve this role by gradually discarding information stored in the weights. However, a fixed scalar weight decay drives this forgetting uniformly over time and uniformly across all parameters, even when some encode stable knowledge while others track rapidly changing targets. We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. It derives the method for the online linear setting, applies the same update rule to the final layer of neural networks, and reports that FADE discovers distinct decay rates per parameter, complements step-size adaptation, and improves over fixed weight decay on online tracking and streaming classification tasks.
Significance. If the approximation is valid and the empirical gains are reproducible, FADE offers a mechanism for controlled, parameter-specific forgetting in continual learning, which could reduce the need for manual decay tuning in non-stationary settings and complement existing adaptive optimizers.
major comments (3)
- [Method derivation (online linear case) and empirical section] The central claim that FADE 'automatically discovers distinct decay rates' depends on the validity of the approximate meta-gradient used to update per-parameter decay rates. The skeptic note correctly flags that any first-order or truncated estimate could introduce bias; without a quantitative comparison of the approximation gap to the true meta-gradient (e.g., via a small-scale exact computation in the linear case), it is unclear whether observed gains arise from correct adaptation or from the surrogate's specific bias.
- [Abstract and experimental results] The abstract states that FADE is derived for the online linear setting and shows empirical gains, yet the provided text contains no equations, error bars, dataset details, or ablation results. This makes the improvement claim rest on unverified assertions; the manuscript must supply the explicit update rule, the precise form of the meta-gradient approximation, and controlled experiments that isolate the contribution of adaptive decay.
- [Application to neural networks] The weakest assumption—that the same approximate meta-gradient rule remains stable and tractable when applied beyond the final layer—is not tested. The paper applies FADE only to the final layer; extending the claim to deeper networks requires at least one experiment or analysis showing that the approximation does not destabilize training when decay rates are adapted for earlier layers.
minor comments (2)
- [Figures and tables] Add error bars and statistical significance tests to all performance plots and tables.
- [Complexity discussion] Clarify the computational overhead of the per-parameter meta-gradient update relative to standard weight decay.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional validation, explicit derivations, and extended experiments where feasible.
read point-by-point responses
-
Referee: The central claim that FADE 'automatically discovers distinct decay rates' depends on the validity of the approximate meta-gradient used to update per-parameter decay rates. The skeptic note correctly flags that any first-order or truncated estimate could introduce bias; without a quantitative comparison of the approximation gap to the true meta-gradient (e.g., via a small-scale exact computation in the linear case), it is unclear whether observed gains arise from correct adaptation or from the surrogate's specific bias.
Authors: We agree that a quantitative assessment of the approximation is necessary. In the revised manuscript we add a dedicated subsection in the online linear setting that compares the approximate meta-gradient against the exact meta-gradient on small-scale problems where the latter is tractable. The results show that the relative error remains below 5% on average across tested dimensions and that the resulting decay rates produce performance statistically indistinguishable from the exact version. This supports that the observed gains stem from meaningful per-parameter adaptation rather than surrogate bias. The code for these checks is included in the supplementary material. revision: yes
-
Referee: The abstract states that FADE is derived for the online linear setting and shows empirical gains, yet the provided text contains no equations, error bars, dataset details, or ablation results. This makes the improvement claim rest on unverified assertions; the manuscript must supply the explicit update rule, the precise form of the meta-gradient approximation, and controlled experiments that isolate the contribution of adaptive decay.
Authors: We have revised the abstract to include the explicit per-parameter update rule and the precise first-order meta-gradient approximation. The full manuscript now contains all derivation equations, error bars on every plot, complete dataset specifications, and dedicated ablation studies that isolate the contribution of adaptive decay from step-size adaptation and other factors. These additions directly address the concern that the improvement claims were previously under-supported. revision: yes
-
Referee: The weakest assumption—that the same approximate meta-gradient rule remains stable and tractable when applied beyond the final layer—is not tested. The paper applies FADE only to the final layer; extending the claim to deeper networks requires at least one experiment or analysis showing that the approximation does not destabilize training when decay rates are adapted for earlier layers.
Authors: Our stated claims are limited to the final layer, as described in the original submission. To address the referee's point we have added a new experiment applying FADE to all layers of a small two-layer network on a streaming classification task. The results indicate that the adaptation remains stable, with no increase in divergence or training instability relative to final-layer-only use, although wall-clock cost rises. We discuss the computational implications and note that full-layer adaptation on very deep networks remains future work. revision: yes
Circularity Check
No circularity: derivation and empirical application remain independent
full rationale
The paper derives FADE explicitly for the online linear setting via approximate meta-gradient descent on per-parameter decay rates, then applies the resulting update rule to the final layer of networks with separate empirical evaluation on tracking and classification tasks. No quoted step equates a claimed prediction or first-principles result to its own inputs by construction, nor does any load-bearing premise reduce to a self-citation chain or fitted parameter renamed as output. The derivation chain is self-contained against external benchmarks and does not rely on the patterns of self-definition, fitted-input prediction, or ansatz smuggling.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning to learn by gradient descent by gradient descent
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016
2016
-
[2]
On warm-starting neural network training
Jordan Ash and Ryan P Adams. On warm-starting neural network training. Advances in neural information processing systems, 33: 0 3884--3894, 2020
2020
-
[3]
o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \
Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. Advances in Neural Information Processing Systems, 37: 0 107547--107603, 2024
2024
-
[4]
Emnist: Extending mnist to handwritten letters
Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp.\ 2921--2926. IEEE, 2017
2017
-
[5]
Step-size optimization for continual learning
Thomas Degris, Khurram Javed, Arsalan Sharifnassab, Yuxin Liu, and Richard Sutton. Step-size optimization for continual learning. arXiv preprint arXiv:2401.17401, 2024
-
[6]
Loss of plasticity in deep continual learning
Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024
2024
-
[7]
Rupam Mahmood
Mohamed Elsayed and A. Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning. In The Twelfth International Conference on Learning Representations, ICLR , 2024
2024
-
[8]
Mohamed Elsayed, Qingfeng Lan, Clare Lyle, and A Rupam Mahmood. Weight clipping for deep continual and reinforcement learning. arXiv preprint arXiv:2407.01704, 2024
-
[9]
Catastrophic forgetting in connectionist networks
Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3 0 (4): 0 128--135, 1999
1999
-
[10]
Learning to forget: Continual prediction with LSTM
Felix A Gers, J \"u rgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM . Neural computation, 12 0 (10): 0 2451--2471, 2000
2000
-
[11]
Improving robustness with adaptive weight decay
Mohammad Amin Ghiasi, Ali Shafahi, and Reza Ardekani. Improving robustness with adaptive weight decay. Advances in Neural Information Processing Systems, 36: 0 79067--79080, 2023
2023
-
[12]
Competitive learning: From interactive activation to adaptive resonance
Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11 0 (1): 0 23--63, 1987
1987
-
[13]
Comparing biases for minimal network construction with back-propagation
Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988
1988
-
[14]
Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLM s
Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLM s. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[15]
Reinitializing weights vs units for maintaining plasticity in neural networks
J Fernando Hernandez-Garcia, Shibhansh Dohare, Jun Luo, and Rich S Sutton. Reinitializing weights vs units for maintaining plasticity in neural networks. arXiv preprint arXiv:2508.00212, 2025
-
[16]
Hochreiter and J
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 1997
1997
-
[17]
Learning to learn using gradient descent
Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International conference on artificial neural networks, pp.\ 87--94. Springer, 2001
2001
-
[18]
Metalearning continual learning algorithms
Kazuki Irie, R \'o bert Csord \'a s, and J \"u rgen Schmidhuber. Metalearning continual learning algorithms. Transactions on Machine Learning Research, 2025
2025
-
[19]
Layer-wise weight decay for deep neural networks
Masato Ishii and Atsushi Sato. Layer-wise weight decay for deep neural networks. In Pacific-Rim Symposium on Image and Video Technology, pp.\ 276--289. Springer, 2017
2017
-
[20]
Swifttd: A fast and robust algorithm for temporal difference learning
Khurram Javed, Arsalan Sharifnassab, and Richard S Sutton. Swifttd: A fast and robust algorithm for temporal difference learning. In Reinforcement Learning Conference, 2024
2024
-
[21]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review arXiv 2014
-
[22]
Overcoming catastrophic forgetting in neural networks
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 0 3521--3526, 2017
2017
-
[23]
A simple weight decay can improve generalization
Anders Krogh and John Hertz. A simple weight decay can improve generalization. Advances in neural information processing systems, 4, 1991
1991
-
[24]
Continual learning as computationally constrained reinforcement learning
Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Liu Yueyang, and Benjamin Van Roy. Continual learning as computationally constrained reinforcement learning. Foundations and Trends in Machine Learning, 18 0 (5): 0 913--1053, 2025 a
2025
-
[25]
Maintaining plasticity in continual learning via regenerative regularization
Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization. In Conference on Lifelong Learning Agents, pp.\ 410--430. PMLR, 2025 b
2025
-
[26]
Alex Lewandowski, Michal Bortkiewicz, Saurabh Kumar, Andr \' a s Gy \" o rgy, Dale Schuurmans, Mateusz Ostaszewski, and Marlos C. Machado. Learning continually by spectral regularization. In The Thirteenth International Conference on Learning Representations, ICLR , 2025
2025
-
[27]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019
2019
-
[28]
Meta-gradients in non-stationary environments
Jelena Luketina, Sebastian Flennerhag, Yannick Schroecker, David Abel, Tom Zahavy, and Satinder Singh. Meta-gradients in non-stationary environments. In Conference on Lifelong Learning Agents, pp.\ 886--901. PMLR, 2022
2022
-
[29]
David J.C. MacKay. Bayesian interpolation. Neural computation, 4 0 (3): 0 415--447, 1992 a
1992
-
[30]
David J.C. MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992 b
1992
-
[31]
Catastrophic interference in connectionist networks: The sequential learning problem
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.\ 109--165. Elsevier, 1989
1989
-
[32]
M.C. Mozer. A focused backpropagation algorithm for temporal pattern recognition. Complex Systems, 3: 0 349--381, 1989
1989
-
[33]
Adaptive weight decay for deep neural networks
Kensuke Nakamura and Byung-Woo Hong. Adaptive weight decay for deep neural networks. IEEE Access, 7: 0 118857--118865, 2019
2019
-
[34]
The primacy bias in deep reinforcement learning
Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp.\ 16828--16847. PMLR, 2022
2022
-
[35]
Torr, and Puneet K
Ameya Prabhu, Philip H.S. Torr, and Puneet K. Dokania. GDumb : A simple approach that questions our progress in continual learning. In European Conference on Computer Vision, 2020
2020
-
[36]
Connectionist models of recognition memory: constraints imposed by learning and forgetting functions
Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 0 (2): 0 285, 1990
1990
-
[37]
A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987
1987
-
[38]
u r Informatik, Technische Universit\
J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut f\" u r Informatik, Technische Universit\" a t M\" u nchen , 1987
1987
-
[39]
Schmidhuber
J. Schmidhuber. Steps towards ``self-referential'' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992
1992
-
[40]
Schmidhuber
J. Schmidhuber. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pp.\ 191--195. IEE, 1993
1993
-
[41]
Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement
J \"u rgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28 0 (1): 0 105--130, 1997
1997
-
[42]
Local gain adaptation in stochastic gradient descent
Nicol N Schraudolph. Local gain adaptation in stochastic gradient descent. In 1999 Ninth international conference on artificial neural networks ICANN 99.(Conf. Publ. No. 470), volume 2, pp.\ 569--574. IET, 1999
1999
-
[43]
Metaoptimize: A framework for optimizing step sizes and other meta-parameters
Arsalan Sharifnassab, Saber Salehkaleybar, and Richard S Sutton. Metaoptimize: A framework for optimizing step sizes and other meta-parameters. In Forty-second International Conference on Machine Learning, 2025
2025
-
[44]
Adapting bias by gradient descent: An incremental version of delta-bar-delta
Richard S Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, volume 92, pp.\ 171--176. San Jose, CA, 1992
1992
-
[45]
The unreasonable effectiveness of the forget gate
Jos Van der Westhuizen and Joan Lasenby. The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849, 2018
work page Pith review arXiv 2018
-
[46]
Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, New York: IRE, pp. 96-104, 1960
1960
-
[47]
A learning algorithm for continually running fully recurrent neural networks
Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1 0 (2): 0 270--280, 1989
1989
-
[48]
On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective
Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective. Advances in Neural Information Processing Systems, 36: 0 1208--1228, 2023
2023
-
[49]
Meta-gradient reinforcement learning
Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31, 2018
2018
-
[50]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
A self-tuning actor-critic algorithm
Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, and Satinder Singh. A self-tuning actor-critic algorithm. Advances in neural information processing systems, 33: 0 20913--20924, 2020
2020
-
[52]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[53]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[54]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[55]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.