Activation Function Design Sustains Plasticity in Continual Learning

Lute Lillo; Nick Cheney

arxiv: 2509.22562 · v4 · submitted 2025-09-26 · 💻 cs.LG · cs.AI· cs.CV

Activation Function Design Sustains Plasticity in Continual Learning

Lute Lillo , Nick Cheney This is my paper

Pith reviewed 2026-05-18 12:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords activation functionscontinual learningplasticity lossclass-incremental learningreinforcement learningnon-stationary environmentsSmooth-Leaky

0 comments

The pith

Activation function choice mitigates loss of plasticity in continual learning without added capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In continual learning, models trained on shifting data often lose the ability to adapt to new tasks or environments, a failure mode separate from catastrophic forgetting. The paper establishes that activation functions are a primary, architecture-agnostic factor in this loss and demonstrates that targeted design of their negative branch and saturation properties can sustain adaptability. The authors derive two drop-in replacements, Smooth-Leaky and Randomized Smooth-Leaky, and test them on class-incremental supervised benchmarks plus reinforcement learning in non-stationary MuJoCo domains with controlled distribution and dynamics shifts. A simple stress protocol is introduced to connect activation shape directly to adaptation performance under change. If the central claim holds, this supplies a lightweight, domain-general intervention that avoids the need for extra parameters or task-specific tuning.

Core claim

We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: supervised class-incremental benchmarks and reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change.

What carries the argument

Property-level analysis of negative-branch shape and saturation behavior, which determines adaptation under distribution and dynamics shifts.

If this is right

Models retain greater ability to learn new classes sequentially without extra capacity when using the proposed activations.
Plasticity persists longer under controlled distribution and dynamics shifts in reinforcement learning tasks.
A lightweight diagnostic protocol can identify which activation shapes support adaptation before full training.
The benefit appears across different model architectures, reducing the need for task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same negative-branch principles might improve robustness in other non-stationary regimes such as online or lifelong learning.
Combining these activations with existing regularization or replay methods could compound gains in plasticity preservation.
Parameterizing the negative-branch shape more flexibly could yield further activation variants tuned to specific shift types.

Load-bearing premise

The shape of the negative branch and the saturation behavior of an activation function directly determine how well a model adapts when data distributions or environment dynamics change.

What would settle it

If the new activations produce no measurable improvement in adaptation speed or final performance relative to ReLU across the class-incremental and non-stationary MuJoCo benchmarks under identical training conditions, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2509.22562 by Lute Lillo, Nick Cheney.

**Figure 1.** Figure 1: Desaturation under scaling shocks γ. Left: mean AUSC (lower is better). Middle: SF recovery time (epochs to halve the saturated fraction after the shock; successful recoveries only). Right: SF non-recovery rate (%). Groups: Zero-floor = ReLU, Tanh, Sigmoid; Non-zero-floor = Leaky-ReLU, RReLU, PReLU; Effective non-zero-floor = ELU, CELU, SELU, GELU, Swish. See App. D.2 for details [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 2.** Figure 2: Sidedness effects under shocks. Left: Peak saturated fraction during the shock (higher = more units saturated). Middle: Saturation Fraction (SF) time-to-half-recover (epochs; successful recoveries only; lower is better). Right: AUSC (lower is better). Groups: One-sided (kink) = Leaky-ReLU, PReLU, RReLU; One-sided (smooth) = ELU, CELU, SELU; Two-sided (saturating) = Sigmoid, Tanh. See App. D.3 for details. … view at source ↗

**Figure 3.** Figure 3: Smooth-Leaky with α=0.1, p=3.0, c=5.0. Randomized Smooth-Leaky draws α from bounds; visually it matches Smooth-Leaky for the sampled α. Guided by Sec. 5—(i) strict non-zero floor, (ii) moderate leak, (iii) prefer C 1 over C 0 when (i)–(ii) are held fixed—we introduce two drop-in rectifiers that keep capacity unchanged. The Smooth-Leaky activation function ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Plasticity Score across 5 seeds (95% bootstrap CIs) showing a complete sequence of 3 cycles across [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activation functions can help sustain plasticity in continual learning, but the tests do not isolate negative-branch shape as the main cause.

read the letter

Activation choice can help keep models adaptable when data shifts over time, and this paper introduces two new functions to do that while testing them in class-incremental learning and non-stationary RL environments. They analyze negative-branch shape and saturation, then build Smooth-Leaky and Randomized Smooth-Leaky as drop-in replacements for standard activations. The evaluations cover supervised benchmarks with incremental classes and MuJoCo tasks with controlled distribution and dynamics changes, plus a stress protocol and diagnostics that tie activation shape to adaptation under change. This stands out as a low-cost idea that avoids adding parameters or task-specific modules, which is practical for continual learning work. The cross-domain tests give it some breadth that many single-setting papers lack. The main weakness is that the new activations change smoothness, exact form, and randomization at the same time as the negative branch. Without ablations that fix everything else and vary only the negative-branch properties, the results cannot cleanly show that shape and saturation are the causal drivers rather than side effects of the other differences. The abstract also skips quantitative numbers, error bars, and exclusion details, so the size of any gains stays unclear from the summary alone. This paper is aimed at continual learning researchers who want simple, architecture-agnostic tweaks for non-stationary problems. Someone working on adaptive systems or plasticity diagnostics would get the most out of the stress protocol and the two-domain setup. It deserves peer review because the core problem is real and the approach is lightweight with decent coverage, even if revisions should tighten the causal claims and add clearer controls.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that activation function choice serves as a primary, architecture-agnostic lever for mitigating loss of plasticity in continual learning. Through a property-level analysis of negative-branch shape and saturation behavior, the authors propose Smooth-Leaky and Randomized Smooth-Leaky activations. These are evaluated in supervised class-incremental learning benchmarks and reinforcement learning tasks with non-stationary MuJoCo environments that induce distribution and dynamics shifts. A stress protocol and diagnostics are provided to link activation shape to adaptation under change.

Significance. If substantiated, the finding that thoughtful activation design can sustain plasticity without extra capacity or task-specific tuning would be of high significance for continual learning research. It provides a lightweight, domain-general approach applicable to both supervised and RL settings. The introduction of a stress protocol is a positive contribution for future diagnostics.

major comments (3)

Abstract: The assertion that activation choice is a 'primary' lever lacks supporting comparisons to established methods for mitigating plasticity loss, such as regularization techniques or experience replay, making it difficult to gauge its relative importance.
Section 3 (Property-level analysis): The analysis of negative-branch shape and saturation does not isolate these properties as the causal factors. The proposed Smooth-Leaky and Randomized Smooth-Leaky activations differ from ReLU and LeakyReLU in smoothness, randomization, and functional form simultaneously. Without an ablation that varies only the negative-branch shape while holding other properties constant, the experiments cannot establish the mechanism as load-bearing rather than a correlated side effect.
Section 4 (Experiments): The evaluations in class-incremental benchmarks and non-stationary MuJoCo environments are described, but the manuscript should include quantitative results with error bars, statistical significance tests, and exclusion criteria to allow verification of the reported improvements.

minor comments (2)

Notation: Ensure consistent definition of the new activation functions, perhaps with explicit equations for Smooth-Leaky and Randomized Smooth-Leaky.
Figures: Clarify the visualization of activation shapes and how they relate to the diagnostics in the stress protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we intend to implement.

read point-by-point responses

Referee: Abstract: The assertion that activation choice is a 'primary' lever lacks supporting comparisons to established methods for mitigating plasticity loss, such as regularization techniques or experience replay, making it difficult to gauge its relative importance.

Authors: We agree that the claim of activation choice serving as a 'primary' lever would be strengthened by explicit comparisons to other established techniques. While the manuscript emphasizes the lightweight and architecture-agnostic advantages of this approach, we will add direct comparisons against regularization methods (e.g., EWC) and experience replay in the revised experimental sections for both the supervised and RL settings to better situate the relative contribution. revision: yes
Referee: Section 3 (Property-level analysis): The analysis of negative-branch shape and saturation does not isolate these properties as the causal factors. The proposed Smooth-Leaky and Randomized Smooth-Leaky activations differ from ReLU and LeakyReLU in smoothness, randomization, and functional form simultaneously. Without an ablation that varies only the negative-branch shape while holding other properties constant, the experiments cannot establish the mechanism as load-bearing rather than a correlated side effect.

Authors: We appreciate this observation on the need for tighter causal isolation. The property-level analysis was intended to motivate design choices targeting negative-branch behavior and saturation to support adaptation under non-stationarity. To address the concern that multiple factors change at once, we will incorporate additional ablation experiments in the revised manuscript that hold smoothness and randomization fixed while systematically varying only the negative-branch shape, thereby clarifying whether this property is the primary driver. revision: yes
Referee: Section 4 (Experiments): The evaluations in class-incremental benchmarks and non-stationary MuJoCo environments are described, but the manuscript should include quantitative results with error bars, statistical significance tests, and exclusion criteria to allow verification of the reported improvements.

Authors: We concur that enhanced statistical reporting will improve verifiability. In the revised manuscript we will report means and standard deviations (error bars) across multiple independent runs, include appropriate statistical significance tests (such as paired t-tests with p-values), and explicitly document any exclusion criteria applied to the presented results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark evaluations without self-referential derivations or fitted predictions

full rationale

The manuscript introduces Smooth-Leaky and Randomized Smooth-Leaky activations after a property-level analysis of negative-branch shape and saturation, then reports results on class-incremental and non-stationary MuJoCo benchmarks. No equations, uniqueness theorems, or parameter-fitting steps are described that reduce by construction to author-defined inputs or prior self-citations. The central claims are supported by direct experimental comparison rather than any derivation chain that loops back to its own premises, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that activation shape properties control plasticity loss and on two newly introduced activation functions whose independent evidence is limited to the paper's own experiments.

axioms (1)

domain assumption Negative-branch shape and saturation behavior of activations influence adaptation under change in continual learning.
This premise underpins the property-level analysis and the design of the new nonlinearities.

invented entities (2)

Smooth-Leaky activation no independent evidence
purpose: Drop-in nonlinearity that sustains plasticity through improved negative-branch shape.
Newly proposed in the paper; no external validation cited in abstract.
Randomized Smooth-Leaky activation no independent evidence
purpose: Randomized variant of Smooth-Leaky for additional robustness in non-stationary settings.
Newly proposed in the paper; no external validation cited in abstract.

pith-pipeline@v0.9.0 · 5726 in / 1325 out tokens · 45038 ms · 2026-05-18T12:56:26.175750+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

TeLAPA maintains archives of behaviorally diverse yet competent policies aligned in a shared latent space to preserve plasticity and enable faster recovery after interference in continual reinforcement learning.
On the Stability of Growth in Structural Plasticity
cs.LG 2026-05 unverdicted novelty 5.0

Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in compl...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Continuously Differentiable Exponential Linear Units

URL https://proceedings.mlr.press/v202/ball23a.html. Jonathan T Barron. Continuously differentiable exponential linear units.arXiv preprint arXiv:1704.07483,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Smith, Razvan Pascanu, and Claudia Clopath

Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, and Claudia Clopath. A study on the plasticity of neural networks.arXiv preprint arXiv:2106.00042,

work page arXiv
[3]

Evolutionary optimization of deep learning activation functions

Garrett Bingham, William Macke, and Risto Miikkulainen. Evolutionary optimization of deep learning activation functions. InProceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 289–296,

work page 2020
[4]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 4(5):11,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Recurrent rational networks.arXiv preprint arXiv:2102.09407, 2021a

Quentin Delfosse, Patrick Schramowski, Alejandro Molina, and Kristian Kersting. Recurrent rational networks.arXiv preprint arXiv:2102.09407, 2021a. Quentin Delfosse, Patrick Schramowski, Martin Mundt, Alejandro Molina, and Kristian Kersting. Adaptive rational activations to boost deep reinforcement learning.arXiv preprint arXiv:2102.09407, 2021b. Shibhans...

work page arXiv
[6]

Addressing loss of plasticity and catastrophic forgetting in continual learning.arXiv preprint arXiv:2404.00781,

Mohamed Elsayed and A Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning.arXiv preprint arXiv:2404.00781,

work page arXiv
[7]

10 Preprint. Under review as a conference paper at ICLR 2026 Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip HS Torr, and Bernard Ghanem. Real-time evaluation in online continual learning: A new hope. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11888–11897,

work page 2026
[8]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Plasticity Loss in Deep Reinforcement Learning: A Survey

Timo Klein, Lukas Miklautz, Kevin Sidak, Claudia Plant, and Sebastian Tschiatschek. Plasticity loss in deep reinforcement learning: A survey.arXiv preprint arXiv:2411.04832,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Implicit under-parameterization inhibits data-efficient deep reinforcement learning.arXiv preprint arXiv:2010.14498,

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning.arXiv preprint arXiv:2010.14498,

work page arXiv 2010
[12]

Clare Lyle, Mark Rowland, and Will Dabney

Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization.arXiv preprint arXiv:2308.11958,

work page arXiv
[13]

Directions of curvature as an explanation for loss of plasticity.arXiv preprint arXiv:2312.00246,

Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, and Marlos C Machado. Directions of curvature as an explanation for loss of plasticity.arXiv preprint arXiv:2312.00246,

work page arXiv
[14]

Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061,

Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061,

work page arXiv
[15]

Under review as a conference paper at ICLR 2026 Clare Lyle, Mark Rowland, and Will Dabney

11 Preprint. Under review as a conference paper at ICLR 2026 Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement learning.arXiv preprint arXiv:2204.09560,

work page arXiv 2026
[16]

Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762,

work page arXiv
[17]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning

Michal Nauman, Michał Bortkiewicz, Piotr Miło´s, Tomasz Trzci´nski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. arXiv preprint arXiv:2403.00514,

work page arXiv
[19]

URL https://proceedings.mlr.press/v28/ pascanu13.html

PMLR. URL https://proceedings.mlr.press/v28/ pascanu13.html. Quang Pham, Chenghao Liu, and Steven Hoi. Continual normalization: Rethinking batch normalization for online continual learning.arXiv preprint arXiv:2203.16102,

work page arXiv
[20]

Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253,

Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, and Ozan Sener. Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253,

work page arXiv
[21]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

12 Preprint. Under review as a conference paper at ICLR 2026 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252,

work page 2026
[23]

Proximal Policy Optimization Algorithms

doi: 10.1007/s11263-015-0816-y. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11263-015-0816-y
[24]

Balancing expressivity and robustness: Constrained rational activations for reinforcement learning.arXiv preprint arXiv:2507.14736,

Rafał Surdej, Michał Bortkiewicz, Alex Lewandowski, Mateusz Ostaszewski, and Clare Lyle. Balancing expressivity and robustness: Constrained rational activations for reinforcement learning.arXiv preprint arXiv:2507.14736,

work page arXiv
[25]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulao, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Empirical Evaluation of Rectified Activations in Convolutional Network

Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network.arXiv preprint arXiv:1505.00853,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

13 Preprint. Under review as a conference paper at ICLR 2026 A CHARACTERIZATION OFACTIVATIONFUNCTIONPROPERTIES ActivationHDZ NZG Sat±Sat−C 1 NonM SelfNL/R slp f′′ ReLU Nair & Hinton (2010)✓– –✓– – – – – LeakyReLU Maas et al. (2013) –✓– – – – – – – PReLU He et al. (2015) –✓– – – – –✓– RReLU Xu et al. (2015) –✓– – – – –✓– Sigmoid –✓*✓ ✓ ✓– – –✓ Tanh –✓*✓ ✓ ...

work page 2026
[28]

Forcontinual RL(Sec

All architectures end with a linear classifier whose output size is 10 for Permuted MNIST, Random Label MNIST, and Random Label CIFAR; 100 for 5+1 CIFAR; and2for Continual ImageNet. Forcontinual RL(Sec. 7), policy and value functions share a multi–head MLP designed for sequential adaptation: a shared backbone with two hidden layers of 256 units is trained...

work page 2026
[29]

CReLU (+half)

with highly expressive forms (e.g., Rational activations Delfosse et al. (2020)) also lagged behind our first-principles designs here. This suggests that we have not yet reconciled expressivity with robust automation. A promising direction is to rethink where and on what timescale hyperparameters are adapted—potentially decoupling their updates from the m...

work page 2020
[30]

On the other hand, Rational Activations are defined by the polynomial of the numerator P(x), denominator Q(x), the rational version (V) used (e.g., A, B, C, or D), and the function (Af ) that is trying to approximate (e.g., ReLU, Swish, etc). B.4I.I.D.VSCLASS-INCREMENTALCONTINUALLEARNING COMPARISONHYPERPARAMETERS Following the grid search over the hyper-p...

work page 2024
[31]

B.6 CURVATUREMETRICS In order to study the properties of the curvature of a neural network and how it affects the loss of plasticity, we need to work with the Hessian matrix. For a loss function L(θ) (where θ represents all the parameters of the network), the Hessian matrixHis defined as: H=∇ 2L(θ) This is a symmetric matrix that captures the second-order...

work page 2026
[32]

expanding

mod|Γ| each time a shock epoch occurs. Thus every Cl epochs we devote exactlyoneepoch to a scale–shock whose value alternates 1→1.5→1→0.5→1→0.25→1→2.00→1→. . . . The multiplicative factor is appliedafterall layers andbeforeits non-linearity; all other epochs run withγ=1. 22 Preprint. Under review as a conference paper at ICLR 2026 D.2 DERIVATIVE-FLOOR RUL...

work page 2026
[33]

leaky–family

After 400 epochs, a new task arrives with an independent random labeling; we run50 tasks in sequence. Inputs are identical across tasks; only concepts change, directly probing plasticity versus interference. Random Label CIFAR(concept shift). Identical protocol to Random Label MNIST, but using images drawn from CIFAR–10. We again use a fixed subset of1,20...

work page 2024
[34]

Nonetheless, this tends holds across Leaky-ReLU, PReLU, Smooth-Leaky, and Rand

However, this is not a rule, and optimal ‘Goldilocks zone’ might vary between activations and settings. Nonetheless, this tends holds across Leaky-ReLU, PReLU, Smooth-Leaky, and Rand. Smooth-Leaky , and even for RReLU when considering the average of its bounds (important because that average initializes the effective leak). For these activations, we also ...

work page 2026
[35]

can the agent still perform well after repeated shifts on the data it now collects?

PReLU’sα indicates initial parameter value. Smooth-Leaky triplets indicate c, p, α, while Rand. Smooth-Leaky indicates c, p, and bounds [l, u]. The tuple of values from Rational indicates Version, ((P), (Q)), Function Approx. where (P) and (Q) are the numerator and denominator degrees respectively of the polynomial. Activation Mean±95% CI Plasticity Score...

work page arXiv 2026

[1] [1]

Continuously Differentiable Exponential Linear Units

URL https://proceedings.mlr.press/v202/ball23a.html. Jonathan T Barron. Continuously differentiable exponential linear units.arXiv preprint arXiv:1704.07483,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Smith, Razvan Pascanu, and Claudia Clopath

Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, and Claudia Clopath. A study on the plasticity of neural networks.arXiv preprint arXiv:2106.00042,

work page arXiv

[3] [3]

Evolutionary optimization of deep learning activation functions

Garrett Bingham, William Macke, and Risto Miikkulainen. Evolutionary optimization of deep learning activation functions. InProceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 289–296,

work page 2020

[4] [4]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 4(5):11,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Recurrent rational networks.arXiv preprint arXiv:2102.09407, 2021a

Quentin Delfosse, Patrick Schramowski, Alejandro Molina, and Kristian Kersting. Recurrent rational networks.arXiv preprint arXiv:2102.09407, 2021a. Quentin Delfosse, Patrick Schramowski, Martin Mundt, Alejandro Molina, and Kristian Kersting. Adaptive rational activations to boost deep reinforcement learning.arXiv preprint arXiv:2102.09407, 2021b. Shibhans...

work page arXiv

[6] [6]

Addressing loss of plasticity and catastrophic forgetting in continual learning.arXiv preprint arXiv:2404.00781,

Mohamed Elsayed and A Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning.arXiv preprint arXiv:2404.00781,

work page arXiv

[7] [7]

10 Preprint. Under review as a conference paper at ICLR 2026 Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip HS Torr, and Bernard Ghanem. Real-time evaluation in online continual learning: A new hope. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11888–11897,

work page 2026

[8] [8]

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Plasticity Loss in Deep Reinforcement Learning: A Survey

Timo Klein, Lukas Miklautz, Kevin Sidak, Claudia Plant, and Sebastian Tschiatschek. Plasticity loss in deep reinforcement learning: A survey.arXiv preprint arXiv:2411.04832,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Implicit under-parameterization inhibits data-efficient deep reinforcement learning.arXiv preprint arXiv:2010.14498,

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning.arXiv preprint arXiv:2010.14498,

work page arXiv 2010

[12] [12]

Clare Lyle, Mark Rowland, and Will Dabney

Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization.arXiv preprint arXiv:2308.11958,

work page arXiv

[13] [13]

Directions of curvature as an explanation for loss of plasticity.arXiv preprint arXiv:2312.00246,

Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, and Marlos C Machado. Directions of curvature as an explanation for loss of plasticity.arXiv preprint arXiv:2312.00246,

work page arXiv

[14] [14]

Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061,

Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061,

work page arXiv

[15] [15]

Under review as a conference paper at ICLR 2026 Clare Lyle, Mark Rowland, and Will Dabney

11 Preprint. Under review as a conference paper at ICLR 2026 Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement learning.arXiv preprint arXiv:2204.09560,

work page arXiv 2026

[16] [16]

Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762,

work page arXiv

[17] [17]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning

Michal Nauman, Michał Bortkiewicz, Piotr Miło´s, Tomasz Trzci´nski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. arXiv preprint arXiv:2403.00514,

work page arXiv

[19] [19]

URL https://proceedings.mlr.press/v28/ pascanu13.html

PMLR. URL https://proceedings.mlr.press/v28/ pascanu13.html. Quang Pham, Chenghao Liu, and Steven Hoi. Continual normalization: Rethinking batch normalization for online continual learning.arXiv preprint arXiv:2203.16102,

work page arXiv

[20] [20]

Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253,

Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, and Ozan Sener. Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253,

work page arXiv

[21] [21]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

12 Preprint. Under review as a conference paper at ICLR 2026 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252,

work page 2026

[23] [23]

Proximal Policy Optimization Algorithms

doi: 10.1007/s11263-015-0816-y. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11263-015-0816-y

[24] [24]

Balancing expressivity and robustness: Constrained rational activations for reinforcement learning.arXiv preprint arXiv:2507.14736,

Rafał Surdej, Michał Bortkiewicz, Alex Lewandowski, Mateusz Ostaszewski, and Clare Lyle. Balancing expressivity and robustness: Constrained rational activations for reinforcement learning.arXiv preprint arXiv:2507.14736,

work page arXiv

[25] [25]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulao, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Empirical Evaluation of Rectified Activations in Convolutional Network

Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network.arXiv preprint arXiv:1505.00853,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

13 Preprint. Under review as a conference paper at ICLR 2026 A CHARACTERIZATION OFACTIVATIONFUNCTIONPROPERTIES ActivationHDZ NZG Sat±Sat−C 1 NonM SelfNL/R slp f′′ ReLU Nair & Hinton (2010)✓– –✓– – – – – LeakyReLU Maas et al. (2013) –✓– – – – – – – PReLU He et al. (2015) –✓– – – – –✓– RReLU Xu et al. (2015) –✓– – – – –✓– Sigmoid –✓*✓ ✓ ✓– – –✓ Tanh –✓*✓ ✓ ...

work page 2026

[28] [28]

Forcontinual RL(Sec

All architectures end with a linear classifier whose output size is 10 for Permuted MNIST, Random Label MNIST, and Random Label CIFAR; 100 for 5+1 CIFAR; and2for Continual ImageNet. Forcontinual RL(Sec. 7), policy and value functions share a multi–head MLP designed for sequential adaptation: a shared backbone with two hidden layers of 256 units is trained...

work page 2026

[29] [29]

CReLU (+half)

with highly expressive forms (e.g., Rational activations Delfosse et al. (2020)) also lagged behind our first-principles designs here. This suggests that we have not yet reconciled expressivity with robust automation. A promising direction is to rethink where and on what timescale hyperparameters are adapted—potentially decoupling their updates from the m...

work page 2020

[30] [30]

On the other hand, Rational Activations are defined by the polynomial of the numerator P(x), denominator Q(x), the rational version (V) used (e.g., A, B, C, or D), and the function (Af ) that is trying to approximate (e.g., ReLU, Swish, etc). B.4I.I.D.VSCLASS-INCREMENTALCONTINUALLEARNING COMPARISONHYPERPARAMETERS Following the grid search over the hyper-p...

work page 2024

[31] [31]

B.6 CURVATUREMETRICS In order to study the properties of the curvature of a neural network and how it affects the loss of plasticity, we need to work with the Hessian matrix. For a loss function L(θ) (where θ represents all the parameters of the network), the Hessian matrixHis defined as: H=∇ 2L(θ) This is a symmetric matrix that captures the second-order...

work page 2026

[32] [32]

expanding

mod|Γ| each time a shock epoch occurs. Thus every Cl epochs we devote exactlyoneepoch to a scale–shock whose value alternates 1→1.5→1→0.5→1→0.25→1→2.00→1→. . . . The multiplicative factor is appliedafterall layers andbeforeits non-linearity; all other epochs run withγ=1. 22 Preprint. Under review as a conference paper at ICLR 2026 D.2 DERIVATIVE-FLOOR RUL...

work page 2026

[33] [33]

leaky–family

After 400 epochs, a new task arrives with an independent random labeling; we run50 tasks in sequence. Inputs are identical across tasks; only concepts change, directly probing plasticity versus interference. Random Label CIFAR(concept shift). Identical protocol to Random Label MNIST, but using images drawn from CIFAR–10. We again use a fixed subset of1,20...

work page 2024

[34] [34]

Nonetheless, this tends holds across Leaky-ReLU, PReLU, Smooth-Leaky, and Rand

However, this is not a rule, and optimal ‘Goldilocks zone’ might vary between activations and settings. Nonetheless, this tends holds across Leaky-ReLU, PReLU, Smooth-Leaky, and Rand. Smooth-Leaky , and even for RReLU when considering the average of its bounds (important because that average initializes the effective leak). For these activations, we also ...

work page 2026

[35] [35]

can the agent still perform well after repeated shifts on the data it now collects?

PReLU’sα indicates initial parameter value. Smooth-Leaky triplets indicate c, p, α, while Rand. Smooth-Leaky indicates c, p, and bounds [l, u]. The tuple of values from Rational indicates Version, ((P), (Q)), Function Approx. where (P) and (Q) are the numerator and denominator degrees respectively of the polynomial. Activation Mean±95% CI Plasticity Score...

work page arXiv 2026