arxiv: 2604.27063 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.NE

Recognition: unknown

Learning to Forget: Continual Learning with Adaptive Weight Decay

Aditya A. Ramesh , Alex Lewandowski , J\"urgen Schmidhuber

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:32 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords continual learningadaptive weight decayforgettingmeta-gradient descentonline learningstreaming classificationneural networks

0 comments

The pith

Per-parameter adaptive weight decay enables selective forgetting to balance new learning and retention in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continual learning agents with finite capacity must forget knowledge that is no longer useful to free resources for new information. Fixed scalar weight decay applies uniform forgetting across all parameters and over time, which wastes capacity when some weights encode stable facts while others must track shifting targets. The paper introduces Forgetting through Adaptive Decay (FADE) that learns distinct decay rates for each parameter through online approximate meta-gradient descent. This mechanism is derived for the online linear case and then applied to the final layer of neural networks. Experiments show FADE discovers useful per-parameter differences, works alongside step-size adaptation, and outperforms fixed decay on online tracking and streaming classification tasks.

Core claim

We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.

What carries the argument

Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent to produce selective forgetting.

If this is right

FADE automatically discovers distinct decay rates for different parameters.
It complements step-size adaptation in online settings.
It consistently improves over fixed weight decay in online tracking and streaming classification.
Weight decay can function as a controlled forgetting mechanism that frees capacity for new knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the meta-gradient approach scales stably, FADE could be extended to internal layers of deeper networks for broader continual learning gains.
Similar per-parameter adaptation ideas might transfer to other forms of regularization or memory management in streaming scenarios.
Direct comparison of learned decay rates against manually tuned schedules could clarify when adaptation provides the largest benefit.

Load-bearing premise

Approximate meta-gradient descent on per-parameter decay rates remains stable and computationally tractable when applied beyond the final layer or in deeper networks.

What would settle it

Applying FADE to multiple layers of a deep network and observing instability, divergence, or worse performance than fixed decay would show the central claim does not hold generally.

Figures

Figures reproduced from arXiv: 2604.27063 by Aditya A. Ramesh, Alex Lewandowski, J\"urgen Schmidhuber.

**Figure 1.** Figure 1: Evolution of FADE’s average decay rates for relevant and irrelevant weight groups on the linear tracking problem with zero noise, starting from λ0 ≈ 0.3(γ0 = −1.2), with α = 0.1 and θλ = 0.01. Setup. We consider a non-stationary tracking task that was previously used to motivate step size adaptation with meta-gradients (Degris et al., 2024). The learner is presented with a d = 20 dimensional input at every… view at source ↗

**Figure 2.** Figure 2: MSE by group on nonlinear tracking problem. With view at source ↗

**Figure 3.** Figure 3: Evolution of FADE’s average per-group decay rates on the nonlinear tracking problem starting from a shared initialization λ0 ≈ 10−4 (dashed line), and θλ = 2.0. Results. Main results are presented in view at source ↗

**Figure 5.** Figure 5: Online accuracy on label-permuted EMNIST view at source ↗

**Figure 6.** Figure 6: Online accuracy with partial label permutations. view at source ↗

read the original abstract

Continual learning agents with finite capacity must balance acquiring new knowledge with retaining the old. This requires controlled forgetting of knowledge that is no longer needed, freeing up capacity to learn. Weight decay, viewed as a mechanism for forgetting, can serve this role by gradually discarding information stored in the weights. However, a fixed scalar weight decay drives this forgetting uniformly over time and uniformly across all parameters, even when some encode stable knowledge while others track rapidly changing targets. We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FADE adapts per-parameter weight decay via online meta-gradients and shows gains over fixed decay in linear and final-layer settings, but the approximation in the meta-update is unquantified and the empirical support stays thin.

read the letter

The core claim is that per-parameter adaptive weight decay, tuned by approximate meta-gradient descent, lets models forget selectively in continual learning. They derive the update for the online linear case and then reuse the same rule on the final layer of networks. That combination is new relative to standard fixed-decay or global regularization baselines, and the idea of letting decay rates differ across parameters makes sense when some weights track changing targets while others hold stable knowledge. The paper also notes that the method complements step-size adaptation, which is a reasonable observation if both mechanisms are running together. On the positive side, the derivation for the linear setting gives the method a clear starting point rather than a purely heuristic tweak. The reported behavior that distinct decay rates emerge automatically is at least consistent with the motivation. That said, the meta-gradient is approximate, and nothing in the abstract or the stress-test note quantifies how large the approximation error is or whether it biases the discovered rates. Without that check, it is hard to know whether the gains come from correct adaptation or from whatever bias the surrogate introduces. The experiments are limited to online tracking and streaming classification, with no error bars, dataset sizes, or ablations shown in the summary. Extending the rule only to the final layer also leaves open whether the same approximation stays stable deeper in the network. Readers working on regularization-based continual learning will find the direction worth examining, especially if they already use meta-learning tools. The work is coherent on its own terms and engages the right literature, so it deserves a serious referee rather than a desk reject. I would send it out, but I would ask the authors to measure the meta-gradient gap and to add controls that isolate the effect of the approximation.

Referee Report

3 major / 2 minor

Summary. The paper introduces Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. It derives the method for the online linear setting, applies the same update rule to the final layer of neural networks, and reports that FADE discovers distinct decay rates per parameter, complements step-size adaptation, and improves over fixed weight decay on online tracking and streaming classification tasks.

Significance. If the approximation is valid and the empirical gains are reproducible, FADE offers a mechanism for controlled, parameter-specific forgetting in continual learning, which could reduce the need for manual decay tuning in non-stationary settings and complement existing adaptive optimizers.

major comments (3)

[Method derivation (online linear case) and empirical section] The central claim that FADE 'automatically discovers distinct decay rates' depends on the validity of the approximate meta-gradient used to update per-parameter decay rates. The skeptic note correctly flags that any first-order or truncated estimate could introduce bias; without a quantitative comparison of the approximation gap to the true meta-gradient (e.g., via a small-scale exact computation in the linear case), it is unclear whether observed gains arise from correct adaptation or from the surrogate's specific bias.
[Abstract and experimental results] The abstract states that FADE is derived for the online linear setting and shows empirical gains, yet the provided text contains no equations, error bars, dataset details, or ablation results. This makes the improvement claim rest on unverified assertions; the manuscript must supply the explicit update rule, the precise form of the meta-gradient approximation, and controlled experiments that isolate the contribution of adaptive decay.
[Application to neural networks] The weakest assumption—that the same approximate meta-gradient rule remains stable and tractable when applied beyond the final layer—is not tested. The paper applies FADE only to the final layer; extending the claim to deeper networks requires at least one experiment or analysis showing that the approximation does not destabilize training when decay rates are adapted for earlier layers.

minor comments (2)

[Figures and tables] Add error bars and statistical significance tests to all performance plots and tables.
[Complexity discussion] Clarify the computational overhead of the per-parameter meta-gradient update relative to standard weight decay.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional validation, explicit derivations, and extended experiments where feasible.

read point-by-point responses

Referee: The central claim that FADE 'automatically discovers distinct decay rates' depends on the validity of the approximate meta-gradient used to update per-parameter decay rates. The skeptic note correctly flags that any first-order or truncated estimate could introduce bias; without a quantitative comparison of the approximation gap to the true meta-gradient (e.g., via a small-scale exact computation in the linear case), it is unclear whether observed gains arise from correct adaptation or from the surrogate's specific bias.

Authors: We agree that a quantitative assessment of the approximation is necessary. In the revised manuscript we add a dedicated subsection in the online linear setting that compares the approximate meta-gradient against the exact meta-gradient on small-scale problems where the latter is tractable. The results show that the relative error remains below 5% on average across tested dimensions and that the resulting decay rates produce performance statistically indistinguishable from the exact version. This supports that the observed gains stem from meaningful per-parameter adaptation rather than surrogate bias. The code for these checks is included in the supplementary material. revision: yes
Referee: The abstract states that FADE is derived for the online linear setting and shows empirical gains, yet the provided text contains no equations, error bars, dataset details, or ablation results. This makes the improvement claim rest on unverified assertions; the manuscript must supply the explicit update rule, the precise form of the meta-gradient approximation, and controlled experiments that isolate the contribution of adaptive decay.

Authors: We have revised the abstract to include the explicit per-parameter update rule and the precise first-order meta-gradient approximation. The full manuscript now contains all derivation equations, error bars on every plot, complete dataset specifications, and dedicated ablation studies that isolate the contribution of adaptive decay from step-size adaptation and other factors. These additions directly address the concern that the improvement claims were previously under-supported. revision: yes
Referee: The weakest assumption—that the same approximate meta-gradient rule remains stable and tractable when applied beyond the final layer—is not tested. The paper applies FADE only to the final layer; extending the claim to deeper networks requires at least one experiment or analysis showing that the approximation does not destabilize training when decay rates are adapted for earlier layers.

Authors: Our stated claims are limited to the final layer, as described in the original submission. To address the referee's point we have added a new experiment applying FADE to all layers of a small two-layer network on a streaming classification task. The results indicate that the adaptation remains stable, with no increase in divergence or training instability relative to final-layer-only use, although wall-clock cost rises. We discuss the computational implications and note that full-layer adaptation on very deep networks remains future work. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation and empirical application remain independent

full rationale

The paper derives FADE explicitly for the online linear setting via approximate meta-gradient descent on per-parameter decay rates, then applies the resulting update rule to the final layer of networks with separate empirical evaluation on tracking and classification tasks. No quoted step equates a claimed prediction or first-principles result to its own inputs by construction, nor does any load-bearing premise reduce to a self-citation chain or fitted parameter renamed as output. The derivation chain is self-contained against external benchmarks and does not rely on the patterns of self-definition, fitted-input prediction, or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the derivation is stated to exist for the online linear setting but is not detailed.

pith-pipeline@v0.9.0 · 5449 in / 1022 out tokens · 60119 ms · 2026-05-07T10:32:11.578014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Learning to learn by gradient descent by gradient descent

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016

2016
[2]

On warm-starting neural network training

Jordan Ash and Ryan P Adams. On warm-starting neural network training. Advances in neural information processing systems, 33: 0 3884--3894, 2020

2020
[3]

o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. Advances in Neural Information Processing Systems, 37: 0 107547--107603, 2024

2024
[4]

Emnist: Extending mnist to handwritten letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp.\ 2921--2926. IEEE, 2017

2017
[5]

Step-size optimization for continual learning

Thomas Degris, Khurram Javed, Arsalan Sharifnassab, Yuxin Liu, and Richard Sutton. Step-size optimization for continual learning. arXiv preprint arXiv:2401.17401, 2024

work page arXiv 2024
[6]

Loss of plasticity in deep continual learning

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024

2024
[7]

Rupam Mahmood

Mohamed Elsayed and A. Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning. In The Twelfth International Conference on Learning Representations, ICLR , 2024

2024
[8]

Rupam Mahmood

Mohamed Elsayed, Qingfeng Lan, Clare Lyle, and A Rupam Mahmood. Weight clipping for deep continual and reinforcement learning. arXiv preprint arXiv:2407.01704, 2024

work page arXiv 2024
[9]

Catastrophic forgetting in connectionist networks

Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3 0 (4): 0 128--135, 1999

1999
[10]

Learning to forget: Continual prediction with LSTM

Felix A Gers, J \"u rgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM . Neural computation, 12 0 (10): 0 2451--2471, 2000

2000
[11]

Improving robustness with adaptive weight decay

Mohammad Amin Ghiasi, Ali Shafahi, and Reza Ardekani. Improving robustness with adaptive weight decay. Advances in Neural Information Processing Systems, 36: 0 79067--79080, 2023

2023
[12]

Competitive learning: From interactive activation to adaptive resonance

Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11 0 (1): 0 23--63, 1987

1987
[13]

Comparing biases for minimal network construction with back-propagation

Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988

1988
[14]

Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLM s

Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLM s. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[15]

Reinitializing weights vs units for maintaining plasticity in neural networks

J Fernando Hernandez-Garcia, Shibhansh Dohare, Jun Luo, and Rich S Sutton. Reinitializing weights vs units for maintaining plasticity in neural networks. arXiv preprint arXiv:2508.00212, 2025

work page arXiv 2025
[16]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 1997

1997
[17]

Learning to learn using gradient descent

Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International conference on artificial neural networks, pp.\ 87--94. Springer, 2001

2001
[18]

Metalearning continual learning algorithms

Kazuki Irie, R \'o bert Csord \'a s, and J \"u rgen Schmidhuber. Metalearning continual learning algorithms. Transactions on Machine Learning Research, 2025

2025
[19]

Layer-wise weight decay for deep neural networks

Masato Ishii and Atsushi Sato. Layer-wise weight decay for deep neural networks. In Pacific-Rim Symposium on Image and Video Technology, pp.\ 276--289. Springer, 2017

2017
[20]

Swifttd: A fast and robust algorithm for temporal difference learning

Khurram Javed, Arsalan Sharifnassab, and Richard S Sutton. Swifttd: A fast and robust algorithm for temporal difference learning. In Reinforcement Learning Conference, 2024

2024
[21]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review arXiv 2014
[22]

Overcoming catastrophic forgetting in neural networks

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 0 3521--3526, 2017

2017
[23]

A simple weight decay can improve generalization

Anders Krogh and John Hertz. A simple weight decay can improve generalization. Advances in neural information processing systems, 4, 1991

1991
[24]

Continual learning as computationally constrained reinforcement learning

Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Liu Yueyang, and Benjamin Van Roy. Continual learning as computationally constrained reinforcement learning. Foundations and Trends in Machine Learning, 18 0 (5): 0 913--1053, 2025 a

2025
[25]

Maintaining plasticity in continual learning via regenerative regularization

Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization. In Conference on Lifelong Learning Agents, pp.\ 410--430. PMLR, 2025 b

2025
[26]

Alex Lewandowski, Michal Bortkiewicz, Saurabh Kumar, Andr \' a s Gy \" o rgy, Dale Schuurmans, Mateusz Ostaszewski, and Marlos C. Machado. Learning continually by spectral regularization. In The Thirteenth International Conference on Learning Representations, ICLR , 2025

2025
[27]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019
[28]

Meta-gradients in non-stationary environments

Jelena Luketina, Sebastian Flennerhag, Yannick Schroecker, David Abel, Tom Zahavy, and Satinder Singh. Meta-gradients in non-stationary environments. In Conference on Lifelong Learning Agents, pp.\ 886--901. PMLR, 2022

2022
[29]

David J.C. MacKay. Bayesian interpolation. Neural computation, 4 0 (3): 0 415--447, 1992 a

1992
[30]

David J.C. MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992 b

1992
[31]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.\ 109--165. Elsevier, 1989

1989
[32]

M.C. Mozer. A focused backpropagation algorithm for temporal pattern recognition. Complex Systems, 3: 0 349--381, 1989

1989
[33]

Adaptive weight decay for deep neural networks

Kensuke Nakamura and Byung-Woo Hong. Adaptive weight decay for deep neural networks. IEEE Access, 7: 0 118857--118865, 2019

2019
[34]

The primacy bias in deep reinforcement learning

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp.\ 16828--16847. PMLR, 2022

2022
[35]

Torr, and Puneet K

Ameya Prabhu, Philip H.S. Torr, and Puneet K. Dokania. GDumb : A simple approach that questions our progress in continual learning. In European Conference on Computer Vision, 2020

2020
[36]

Connectionist models of recognition memory: constraints imposed by learning and forgetting functions

Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 0 (2): 0 285, 1990

1990
[37]

A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987

1987
[38]

u r Informatik, Technische Universit\

J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut f\" u r Informatik, Technische Universit\" a t M\" u nchen , 1987

1987
[39]

Schmidhuber

J. Schmidhuber. Steps towards ``self-referential'' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992

1992
[40]

Schmidhuber

J. Schmidhuber. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pp.\ 191--195. IEE, 1993

1993
[41]

Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement

J \"u rgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28 0 (1): 0 105--130, 1997

1997
[42]

Local gain adaptation in stochastic gradient descent

Nicol N Schraudolph. Local gain adaptation in stochastic gradient descent. In 1999 Ninth international conference on artificial neural networks ICANN 99.(Conf. Publ. No. 470), volume 2, pp.\ 569--574. IET, 1999

1999
[43]

Metaoptimize: A framework for optimizing step sizes and other meta-parameters

Arsalan Sharifnassab, Saber Salehkaleybar, and Richard S Sutton. Metaoptimize: A framework for optimizing step sizes and other meta-parameters. In Forty-second International Conference on Machine Learning, 2025

2025
[44]

Adapting bias by gradient descent: An incremental version of delta-bar-delta

Richard S Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, volume 92, pp.\ 171--176. San Jose, CA, 1992

1992
[45]

The unreasonable effectiveness of the forget gate

Jos Van der Westhuizen and Joan Lasenby. The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849, 2018

work page Pith review arXiv 2018
[46]

Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, New York: IRE, pp. 96-104, 1960

1960
[47]

A learning algorithm for continually running fully recurrent neural networks

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1 0 (2): 0 270--280, 1989

1989
[48]

On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective

Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective. Advances in Neural Information Processing Systems, 36: 0 1208--1228, 2023

2023
[49]

Meta-gradient reinforcement learning

Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31, 2018

2018
[50]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review arXiv 2024
[51]

A self-tuning actor-critic algorithm

Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, and Satinder Singh. A self-tuning actor-critic algorithm. Advances in neural information processing systems, 33: 0 20913--20924, 2020

2020
[52]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[53]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[54]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[55]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...