pith. machine review for the scientific record. sign in

arxiv: 2604.27063 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.NE

Recognition: unknown

Learning to Forget: Continual Learning with Adaptive Weight Decay

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:32 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords continual learningadaptive weight decayforgettingmeta-gradient descentonline learningstreaming classificationneural networks
0
0 comments X

The pith

Per-parameter adaptive weight decay enables selective forgetting to balance new learning and retention in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continual learning agents with finite capacity must forget knowledge that is no longer useful to free resources for new information. Fixed scalar weight decay applies uniform forgetting across all parameters and over time, which wastes capacity when some weights encode stable facts while others must track shifting targets. The paper introduces Forgetting through Adaptive Decay (FADE) that learns distinct decay rates for each parameter through online approximate meta-gradient descent. This mechanism is derived for the online linear case and then applied to the final layer of neural networks. Experiments show FADE discovers useful per-parameter differences, works alongside step-size adaptation, and outperforms fixed decay on online tracking and streaming classification tasks.

Core claim

We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.

What carries the argument

Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent to produce selective forgetting.

If this is right

  • FADE automatically discovers distinct decay rates for different parameters.
  • It complements step-size adaptation in online settings.
  • It consistently improves over fixed weight decay in online tracking and streaming classification.
  • Weight decay can function as a controlled forgetting mechanism that frees capacity for new knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the meta-gradient approach scales stably, FADE could be extended to internal layers of deeper networks for broader continual learning gains.
  • Similar per-parameter adaptation ideas might transfer to other forms of regularization or memory management in streaming scenarios.
  • Direct comparison of learned decay rates against manually tuned schedules could clarify when adaptation provides the largest benefit.

Load-bearing premise

Approximate meta-gradient descent on per-parameter decay rates remains stable and computationally tractable when applied beyond the final layer or in deeper networks.

What would settle it

Applying FADE to multiple layers of a deep network and observing instability, divergence, or worse performance than fixed decay would show the central claim does not hold generally.

Figures

Figures reproduced from arXiv: 2604.27063 by Aditya A. Ramesh, Alex Lewandowski, J\"urgen Schmidhuber.

Figure 1
Figure 1. Figure 1: Evolution of FADE’s average decay rates for relevant and irrelevant weight groups on the linear tracking problem with zero noise, starting from λ0 ≈ 0.3(γ0 = −1.2), with α = 0.1 and θλ = 0.01. Setup. We consider a non-stationary tracking task that was previously used to motivate step size adaptation with meta-gradients (Degris et al., 2024). The learner is presented with a d = 20 dimensional input at every… view at source ↗
Figure 2
Figure 2. Figure 2: MSE by group on nonlinear tracking problem. With view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of FADE’s av￾erage per-group decay rates on the nonlinear tracking problem starting from a shared initialization λ0 ≈ 10−4 (dashed line), and θλ = 2.0. Results. Main results are presented in view at source ↗
Figure 5
Figure 5. Figure 5: Online accuracy on label-permuted EMNIST view at source ↗
Figure 6
Figure 6. Figure 6: Online accuracy with partial label permutations. view at source ↗
read the original abstract

Continual learning agents with finite capacity must balance acquiring new knowledge with retaining the old. This requires controlled forgetting of knowledge that is no longer needed, freeing up capacity to learn. Weight decay, viewed as a mechanism for forgetting, can serve this role by gradually discarding information stored in the weights. However, a fixed scalar weight decay drives this forgetting uniformly over time and uniformly across all parameters, even when some encode stable knowledge while others track rapidly changing targets. We introduce Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. We derive FADE for the online linear setting and apply it to the final layer of neural networks. Our empirical analysis shows that FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently improves over fixed weight decay across online tracking and streaming classification problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Forgetting through Adaptive Decay (FADE), which adapts per-parameter weight decay rates online via approximate meta-gradient descent. It derives the method for the online linear setting, applies the same update rule to the final layer of neural networks, and reports that FADE discovers distinct decay rates per parameter, complements step-size adaptation, and improves over fixed weight decay on online tracking and streaming classification tasks.

Significance. If the approximation is valid and the empirical gains are reproducible, FADE offers a mechanism for controlled, parameter-specific forgetting in continual learning, which could reduce the need for manual decay tuning in non-stationary settings and complement existing adaptive optimizers.

major comments (3)
  1. [Method derivation (online linear case) and empirical section] The central claim that FADE 'automatically discovers distinct decay rates' depends on the validity of the approximate meta-gradient used to update per-parameter decay rates. The skeptic note correctly flags that any first-order or truncated estimate could introduce bias; without a quantitative comparison of the approximation gap to the true meta-gradient (e.g., via a small-scale exact computation in the linear case), it is unclear whether observed gains arise from correct adaptation or from the surrogate's specific bias.
  2. [Abstract and experimental results] The abstract states that FADE is derived for the online linear setting and shows empirical gains, yet the provided text contains no equations, error bars, dataset details, or ablation results. This makes the improvement claim rest on unverified assertions; the manuscript must supply the explicit update rule, the precise form of the meta-gradient approximation, and controlled experiments that isolate the contribution of adaptive decay.
  3. [Application to neural networks] The weakest assumption—that the same approximate meta-gradient rule remains stable and tractable when applied beyond the final layer—is not tested. The paper applies FADE only to the final layer; extending the claim to deeper networks requires at least one experiment or analysis showing that the approximation does not destabilize training when decay rates are adapted for earlier layers.
minor comments (2)
  1. [Figures and tables] Add error bars and statistical significance tests to all performance plots and tables.
  2. [Complexity discussion] Clarify the computational overhead of the per-parameter meta-gradient update relative to standard weight decay.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional validation, explicit derivations, and extended experiments where feasible.

read point-by-point responses
  1. Referee: The central claim that FADE 'automatically discovers distinct decay rates' depends on the validity of the approximate meta-gradient used to update per-parameter decay rates. The skeptic note correctly flags that any first-order or truncated estimate could introduce bias; without a quantitative comparison of the approximation gap to the true meta-gradient (e.g., via a small-scale exact computation in the linear case), it is unclear whether observed gains arise from correct adaptation or from the surrogate's specific bias.

    Authors: We agree that a quantitative assessment of the approximation is necessary. In the revised manuscript we add a dedicated subsection in the online linear setting that compares the approximate meta-gradient against the exact meta-gradient on small-scale problems where the latter is tractable. The results show that the relative error remains below 5% on average across tested dimensions and that the resulting decay rates produce performance statistically indistinguishable from the exact version. This supports that the observed gains stem from meaningful per-parameter adaptation rather than surrogate bias. The code for these checks is included in the supplementary material. revision: yes

  2. Referee: The abstract states that FADE is derived for the online linear setting and shows empirical gains, yet the provided text contains no equations, error bars, dataset details, or ablation results. This makes the improvement claim rest on unverified assertions; the manuscript must supply the explicit update rule, the precise form of the meta-gradient approximation, and controlled experiments that isolate the contribution of adaptive decay.

    Authors: We have revised the abstract to include the explicit per-parameter update rule and the precise first-order meta-gradient approximation. The full manuscript now contains all derivation equations, error bars on every plot, complete dataset specifications, and dedicated ablation studies that isolate the contribution of adaptive decay from step-size adaptation and other factors. These additions directly address the concern that the improvement claims were previously under-supported. revision: yes

  3. Referee: The weakest assumption—that the same approximate meta-gradient rule remains stable and tractable when applied beyond the final layer—is not tested. The paper applies FADE only to the final layer; extending the claim to deeper networks requires at least one experiment or analysis showing that the approximation does not destabilize training when decay rates are adapted for earlier layers.

    Authors: Our stated claims are limited to the final layer, as described in the original submission. To address the referee's point we have added a new experiment applying FADE to all layers of a small two-layer network on a streaming classification task. The results indicate that the adaptation remains stable, with no increase in divergence or training instability relative to final-layer-only use, although wall-clock cost rises. We discuss the computational implications and note that full-layer adaptation on very deep networks remains future work. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation and empirical application remain independent

full rationale

The paper derives FADE explicitly for the online linear setting via approximate meta-gradient descent on per-parameter decay rates, then applies the resulting update rule to the final layer of networks with separate empirical evaluation on tracking and classification tasks. No quoted step equates a claimed prediction or first-principles result to its own inputs by construction, nor does any load-bearing premise reduce to a self-citation chain or fitted parameter renamed as output. The derivation chain is self-contained against external benchmarks and does not rely on the patterns of self-definition, fitted-input prediction, or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the derivation is stated to exist for the online linear setting but is not detailed.

pith-pipeline@v0.9.0 · 5449 in / 1022 out tokens · 60119 ms · 2026-05-07T10:32:11.578014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Learning to learn by gradient descent by gradient descent

    Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016

  2. [2]

    On warm-starting neural network training

    Jordan Ash and Ryan P Adams. On warm-starting neural network training. Advances in neural information processing systems, 33: 0 3884--3894, 2020

  3. [3]

    o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \

    Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. Advances in Neural Information Processing Systems, 37: 0 107547--107603, 2024

  4. [4]

    Emnist: Extending mnist to handwritten letters

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp.\ 2921--2926. IEEE, 2017

  5. [5]

    Step-size optimization for continual learning

    Thomas Degris, Khurram Javed, Arsalan Sharifnassab, Yuxin Liu, and Richard Sutton. Step-size optimization for continual learning. arXiv preprint arXiv:2401.17401, 2024

  6. [6]

    Loss of plasticity in deep continual learning

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024

  7. [7]

    Rupam Mahmood

    Mohamed Elsayed and A. Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning. In The Twelfth International Conference on Learning Representations, ICLR , 2024

  8. [8]

    Rupam Mahmood

    Mohamed Elsayed, Qingfeng Lan, Clare Lyle, and A Rupam Mahmood. Weight clipping for deep continual and reinforcement learning. arXiv preprint arXiv:2407.01704, 2024

  9. [9]

    Catastrophic forgetting in connectionist networks

    Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3 0 (4): 0 128--135, 1999

  10. [10]

    Learning to forget: Continual prediction with LSTM

    Felix A Gers, J \"u rgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM . Neural computation, 12 0 (10): 0 2451--2471, 2000

  11. [11]

    Improving robustness with adaptive weight decay

    Mohammad Amin Ghiasi, Ali Shafahi, and Reza Ardekani. Improving robustness with adaptive weight decay. Advances in Neural Information Processing Systems, 36: 0 79067--79080, 2023

  12. [12]

    Competitive learning: From interactive activation to adaptive resonance

    Stephen Grossberg. Competitive learning: From interactive activation to adaptive resonance. Cognitive science, 11 0 (1): 0 23--63, 1987

  13. [13]

    Comparing biases for minimal network construction with back-propagation

    Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988

  14. [14]

    Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLM s

    Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in LLM s. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  15. [15]

    Reinitializing weights vs units for maintaining plasticity in neural networks

    J Fernando Hernandez-Garcia, Shibhansh Dohare, Jun Luo, and Rich S Sutton. Reinitializing weights vs units for maintaining plasticity in neural networks. arXiv preprint arXiv:2508.00212, 2025

  16. [16]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 1997

  17. [17]

    Learning to learn using gradient descent

    Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International conference on artificial neural networks, pp.\ 87--94. Springer, 2001

  18. [18]

    Metalearning continual learning algorithms

    Kazuki Irie, R \'o bert Csord \'a s, and J \"u rgen Schmidhuber. Metalearning continual learning algorithms. Transactions on Machine Learning Research, 2025

  19. [19]

    Layer-wise weight decay for deep neural networks

    Masato Ishii and Atsushi Sato. Layer-wise weight decay for deep neural networks. In Pacific-Rim Symposium on Image and Video Technology, pp.\ 276--289. Springer, 2017

  20. [20]

    Swifttd: A fast and robust algorithm for temporal difference learning

    Khurram Javed, Arsalan Sharifnassab, and Richard S Sutton. Swifttd: A fast and robust algorithm for temporal difference learning. In Reinforcement Learning Conference, 2024

  21. [21]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  22. [22]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 0 3521--3526, 2017

  23. [23]

    A simple weight decay can improve generalization

    Anders Krogh and John Hertz. A simple weight decay can improve generalization. Advances in neural information processing systems, 4, 1991

  24. [24]

    Continual learning as computationally constrained reinforcement learning

    Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Liu Yueyang, and Benjamin Van Roy. Continual learning as computationally constrained reinforcement learning. Foundations and Trends in Machine Learning, 18 0 (5): 0 913--1053, 2025 a

  25. [25]

    Maintaining plasticity in continual learning via regenerative regularization

    Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization. In Conference on Lifelong Learning Agents, pp.\ 410--430. PMLR, 2025 b

  26. [26]

    Alex Lewandowski, Michal Bortkiewicz, Saurabh Kumar, Andr \' a s Gy \" o rgy, Dale Schuurmans, Mateusz Ostaszewski, and Marlos C. Machado. Learning continually by spectral regularization. In The Thirteenth International Conference on Learning Representations, ICLR , 2025

  27. [27]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  28. [28]

    Meta-gradients in non-stationary environments

    Jelena Luketina, Sebastian Flennerhag, Yannick Schroecker, David Abel, Tom Zahavy, and Satinder Singh. Meta-gradients in non-stationary environments. In Conference on Lifelong Learning Agents, pp.\ 886--901. PMLR, 2022

  29. [29]

    David J.C. MacKay. Bayesian interpolation. Neural computation, 4 0 (3): 0 415--447, 1992 a

  30. [30]

    David J.C. MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4 0 (3): 0 448--472, 1992 b

  31. [31]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp.\ 109--165. Elsevier, 1989

  32. [32]

    M.C. Mozer. A focused backpropagation algorithm for temporal pattern recognition. Complex Systems, 3: 0 349--381, 1989

  33. [33]

    Adaptive weight decay for deep neural networks

    Kensuke Nakamura and Byung-Woo Hong. Adaptive weight decay for deep neural networks. IEEE Access, 7: 0 118857--118865, 2019

  34. [34]

    The primacy bias in deep reinforcement learning

    Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International conference on machine learning, pp.\ 16828--16847. PMLR, 2022

  35. [35]

    Torr, and Puneet K

    Ameya Prabhu, Philip H.S. Torr, and Puneet K. Dokania. GDumb : A simple approach that questions our progress in continual learning. In European Conference on Computer Vision, 2020

  36. [36]

    Connectionist models of recognition memory: constraints imposed by learning and forgetting functions

    Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 0 (2): 0 285, 1990

  37. [37]

    A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987

  38. [38]

    u r Informatik, Technische Universit\

    J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut f\" u r Informatik, Technische Universit\" a t M\" u nchen , 1987

  39. [39]

    Schmidhuber

    J. Schmidhuber. Steps towards ``self-referential'' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992

  40. [40]

    Schmidhuber

    J. Schmidhuber. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pp.\ 191--195. IEE, 1993

  41. [41]

    Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement

    J \"u rgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28 0 (1): 0 105--130, 1997

  42. [42]

    Local gain adaptation in stochastic gradient descent

    Nicol N Schraudolph. Local gain adaptation in stochastic gradient descent. In 1999 Ninth international conference on artificial neural networks ICANN 99.(Conf. Publ. No. 470), volume 2, pp.\ 569--574. IET, 1999

  43. [43]

    Metaoptimize: A framework for optimizing step sizes and other meta-parameters

    Arsalan Sharifnassab, Saber Salehkaleybar, and Richard S Sutton. Metaoptimize: A framework for optimizing step sizes and other meta-parameters. In Forty-second International Conference on Machine Learning, 2025

  44. [44]

    Adapting bias by gradient descent: An incremental version of delta-bar-delta

    Richard S Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, volume 92, pp.\ 171--176. San Jose, CA, 1992

  45. [45]

    The unreasonable effectiveness of the forget gate

    Jos Van der Westhuizen and Joan Lasenby. The unreasonable effectiveness of the forget gate. arXiv preprint arXiv:1804.04849, 2018

  46. [46]

    Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, New York: IRE, pp. 96-104, 1960

  47. [47]

    A learning algorithm for continually running fully recurrent neural networks

    Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1 0 (2): 0 270--280, 1989

  48. [48]

    On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective

    Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective. Advances in Neural Information Processing Systems, 36: 0 1208--1228, 2023

  49. [49]

    Meta-gradient reinforcement learning

    Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31, 2018

  50. [50]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024

  51. [51]

    A self-tuning actor-critic algorithm

    Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, and Satinder Singh. A self-tuning actor-critic algorithm. Advances in neural information processing systems, 33: 0 20913--20924, 2020

  52. [52]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  53. [53]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  54. [54]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  55. [55]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...