Activation Function Design Sustains Plasticity in Continual Learning
Pith reviewed 2026-05-18 12:56 UTC · model grok-4.3
The pith
Activation function choice mitigates loss of plasticity in continual learning without added capacity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: supervised class-incremental benchmarks and reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change.
What carries the argument
Property-level analysis of negative-branch shape and saturation behavior, which determines adaptation under distribution and dynamics shifts.
If this is right
- Models retain greater ability to learn new classes sequentially without extra capacity when using the proposed activations.
- Plasticity persists longer under controlled distribution and dynamics shifts in reinforcement learning tasks.
- A lightweight diagnostic protocol can identify which activation shapes support adaptation before full training.
- The benefit appears across different model architectures, reducing the need for task-specific redesign.
Where Pith is reading between the lines
- The same negative-branch principles might improve robustness in other non-stationary regimes such as online or lifelong learning.
- Combining these activations with existing regularization or replay methods could compound gains in plasticity preservation.
- Parameterizing the negative-branch shape more flexibly could yield further activation variants tuned to specific shift types.
Load-bearing premise
The shape of the negative branch and the saturation behavior of an activation function directly determine how well a model adapts when data distributions or environment dynamics change.
What would settle it
If the new activations produce no measurable improvement in adaptation speed or final performance relative to ReLU across the class-incremental and non-stationary MuJoCo benchmarks under identical training conditions, the claim would be falsified.
Figures
read the original abstract
In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that activation function choice serves as a primary, architecture-agnostic lever for mitigating loss of plasticity in continual learning. Through a property-level analysis of negative-branch shape and saturation behavior, the authors propose Smooth-Leaky and Randomized Smooth-Leaky activations. These are evaluated in supervised class-incremental learning benchmarks and reinforcement learning tasks with non-stationary MuJoCo environments that induce distribution and dynamics shifts. A stress protocol and diagnostics are provided to link activation shape to adaptation under change.
Significance. If substantiated, the finding that thoughtful activation design can sustain plasticity without extra capacity or task-specific tuning would be of high significance for continual learning research. It provides a lightweight, domain-general approach applicable to both supervised and RL settings. The introduction of a stress protocol is a positive contribution for future diagnostics.
major comments (3)
- Abstract: The assertion that activation choice is a 'primary' lever lacks supporting comparisons to established methods for mitigating plasticity loss, such as regularization techniques or experience replay, making it difficult to gauge its relative importance.
- Section 3 (Property-level analysis): The analysis of negative-branch shape and saturation does not isolate these properties as the causal factors. The proposed Smooth-Leaky and Randomized Smooth-Leaky activations differ from ReLU and LeakyReLU in smoothness, randomization, and functional form simultaneously. Without an ablation that varies only the negative-branch shape while holding other properties constant, the experiments cannot establish the mechanism as load-bearing rather than a correlated side effect.
- Section 4 (Experiments): The evaluations in class-incremental benchmarks and non-stationary MuJoCo environments are described, but the manuscript should include quantitative results with error bars, statistical significance tests, and exclusion criteria to allow verification of the reported improvements.
minor comments (2)
- Notation: Ensure consistent definition of the new activation functions, perhaps with explicit equations for Smooth-Leaky and Randomized Smooth-Leaky.
- Figures: Clarify the visualization of activation shapes and how they relate to the diagnostics in the stress protocol.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we intend to implement.
read point-by-point responses
-
Referee: Abstract: The assertion that activation choice is a 'primary' lever lacks supporting comparisons to established methods for mitigating plasticity loss, such as regularization techniques or experience replay, making it difficult to gauge its relative importance.
Authors: We agree that the claim of activation choice serving as a 'primary' lever would be strengthened by explicit comparisons to other established techniques. While the manuscript emphasizes the lightweight and architecture-agnostic advantages of this approach, we will add direct comparisons against regularization methods (e.g., EWC) and experience replay in the revised experimental sections for both the supervised and RL settings to better situate the relative contribution. revision: yes
-
Referee: Section 3 (Property-level analysis): The analysis of negative-branch shape and saturation does not isolate these properties as the causal factors. The proposed Smooth-Leaky and Randomized Smooth-Leaky activations differ from ReLU and LeakyReLU in smoothness, randomization, and functional form simultaneously. Without an ablation that varies only the negative-branch shape while holding other properties constant, the experiments cannot establish the mechanism as load-bearing rather than a correlated side effect.
Authors: We appreciate this observation on the need for tighter causal isolation. The property-level analysis was intended to motivate design choices targeting negative-branch behavior and saturation to support adaptation under non-stationarity. To address the concern that multiple factors change at once, we will incorporate additional ablation experiments in the revised manuscript that hold smoothness and randomization fixed while systematically varying only the negative-branch shape, thereby clarifying whether this property is the primary driver. revision: yes
-
Referee: Section 4 (Experiments): The evaluations in class-incremental benchmarks and non-stationary MuJoCo environments are described, but the manuscript should include quantitative results with error bars, statistical significance tests, and exclusion criteria to allow verification of the reported improvements.
Authors: We concur that enhanced statistical reporting will improve verifiability. In the revised manuscript we will report means and standard deviations (error bars) across multiple independent runs, include appropriate statistical significance tests (such as paired t-tests with p-values), and explicitly document any exclusion criteria applied to the presented results. revision: yes
Circularity Check
No circularity: empirical claims rest on benchmark evaluations without self-referential derivations or fitted predictions
full rationale
The manuscript introduces Smooth-Leaky and Randomized Smooth-Leaky activations after a property-level analysis of negative-branch shape and saturation, then reports results on class-incremental and non-stationary MuJoCo benchmarks. No equations, uniqueness theorems, or parameter-fitting steps are described that reduce by construction to author-defined inputs or prior self-citations. The central claims are supported by direct experimental comparison rather than any derivation chain that loops back to its own premises, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Negative-branch shape and saturation behavior of activations influence adaptation under change in continual learning.
invented entities (2)
-
Smooth-Leaky activation
no independent evidence
-
Randomized Smooth-Leaky activation
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
TeLAPA maintains archives of behaviorally diverse yet competent policies aligned in a shared latent space to preserve plasticity and enable faster recovery after interference in continual reinforcement learning.
-
On the Stability of Growth in Structural Plasticity
Newborn units in growing neural networks are forward-active but backward-starved, receiving weaker gradients than existing units and creating integration challenges that make growth less reliable than pruning in compl...
Reference graph
Works this paper leans on
-
[1]
Continuously Differentiable Exponential Linear Units
URL https://proceedings.mlr.press/v202/ball23a.html. Jonathan T Barron. Continuously differentiable exponential linear units.arXiv preprint arXiv:1704.07483,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Smith, Razvan Pascanu, and Claudia Clopath
Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, and Claudia Clopath. A study on the plasticity of neural networks.arXiv preprint arXiv:2106.00042,
-
[3]
Evolutionary optimization of deep learning activation functions
Garrett Bingham, William Macke, and Risto Miikkulainen. Evolutionary optimization of deep learning activation functions. InProceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 289–296,
work page 2020
-
[4]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 4(5):11,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Recurrent rational networks.arXiv preprint arXiv:2102.09407, 2021a
Quentin Delfosse, Patrick Schramowski, Alejandro Molina, and Kristian Kersting. Recurrent rational networks.arXiv preprint arXiv:2102.09407, 2021a. Quentin Delfosse, Patrick Schramowski, Martin Mundt, Alejandro Molina, and Kristian Kersting. Adaptive rational activations to boost deep reinforcement learning.arXiv preprint arXiv:2102.09407, 2021b. Shibhans...
-
[6]
Mohamed Elsayed and A Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning.arXiv preprint arXiv:2404.00781,
-
[7]
10 Preprint. Under review as a conference paper at ICLR 2026 Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip HS Torr, and Bernard Ghanem. Real-time evaluation in online continual learning: A new hope. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11888–11897,
work page 2026
-
[8]
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Plasticity Loss in Deep Reinforcement Learning: A Survey
Timo Klein, Lukas Miklautz, Kevin Sidak, Claudia Plant, and Sebastian Tschiatschek. Plasticity loss in deep reinforcement learning: A survey.arXiv preprint arXiv:2411.04832,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning.arXiv preprint arXiv:2010.14498,
-
[12]
Clare Lyle, Mark Rowland, and Will Dabney
Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization.arXiv preprint arXiv:2308.11958,
-
[13]
Directions of curvature as an explanation for loss of plasticity.arXiv preprint arXiv:2312.00246,
Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, and Marlos C Machado. Directions of curvature as an explanation for loss of plasticity.arXiv preprint arXiv:2312.00246,
-
[14]
Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061,
-
[15]
Under review as a conference paper at ICLR 2026 Clare Lyle, Mark Rowland, and Will Dabney
11 Preprint. Under review as a conference paper at ICLR 2026 Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement learning.arXiv preprint arXiv:2204.09560,
-
[16]
Disentangling the Causes of Plasticity Loss in Neural Networks , February 2024
Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks.arXiv preprint arXiv:2402.18762,
-
[17]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Michal Nauman, Michał Bortkiewicz, Piotr Miło´s, Tomasz Trzci´nski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. arXiv preprint arXiv:2403.00514,
-
[19]
URL https://proceedings.mlr.press/v28/ pascanu13.html
PMLR. URL https://proceedings.mlr.press/v28/ pascanu13.html. Quang Pham, Chenghao Liu, and Steven Hoi. Continual normalization: Rethinking batch normalization for online continual learning.arXiv preprint arXiv:2203.16102,
-
[20]
Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253,
Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, and Ozan Sener. Online continual learning without the storage constraint.arXiv preprint arXiv:2305.09253,
-
[21]
Searching for Activation Functions
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
12 Preprint. Under review as a conference paper at ICLR 2026 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252,
work page 2026
-
[23]
Proximal Policy Optimization Algorithms
doi: 10.1007/s11263-015-0816-y. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11263-015-0816-y
-
[24]
Rafał Surdej, Michał Bortkiewicz, Alex Lewandowski, Mateusz Ostaszewski, and Clare Lyle. Balancing expressivity and robustness: Constrained rational activations for reinforcement learning.arXiv preprint arXiv:2507.14736,
-
[25]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulao, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Empirical Evaluation of Rectified Activations in Convolutional Network
Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network.arXiv preprint arXiv:1505.00853,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
13 Preprint. Under review as a conference paper at ICLR 2026 A CHARACTERIZATION OFACTIVATIONFUNCTIONPROPERTIES ActivationHDZ NZG Sat±Sat−C 1 NonM SelfNL/R slp f′′ ReLU Nair & Hinton (2010)✓– –✓– – – – – LeakyReLU Maas et al. (2013) –✓– – – – – – – PReLU He et al. (2015) –✓– – – – –✓– RReLU Xu et al. (2015) –✓– – – – –✓– Sigmoid –✓*✓ ✓ ✓– – –✓ Tanh –✓*✓ ✓ ...
work page 2026
-
[28]
All architectures end with a linear classifier whose output size is 10 for Permuted MNIST, Random Label MNIST, and Random Label CIFAR; 100 for 5+1 CIFAR; and2for Continual ImageNet. Forcontinual RL(Sec. 7), policy and value functions share a multi–head MLP designed for sequential adaptation: a shared backbone with two hidden layers of 256 units is trained...
work page 2026
-
[29]
with highly expressive forms (e.g., Rational activations Delfosse et al. (2020)) also lagged behind our first-principles designs here. This suggests that we have not yet reconciled expressivity with robust automation. A promising direction is to rethink where and on what timescale hyperparameters are adapted—potentially decoupling their updates from the m...
work page 2020
-
[30]
On the other hand, Rational Activations are defined by the polynomial of the numerator P(x), denominator Q(x), the rational version (V) used (e.g., A, B, C, or D), and the function (Af ) that is trying to approximate (e.g., ReLU, Swish, etc). B.4I.I.D.VSCLASS-INCREMENTALCONTINUALLEARNING COMPARISONHYPERPARAMETERS Following the grid search over the hyper-p...
work page 2024
-
[31]
B.6 CURVATUREMETRICS In order to study the properties of the curvature of a neural network and how it affects the loss of plasticity, we need to work with the Hessian matrix. For a loss function L(θ) (where θ represents all the parameters of the network), the Hessian matrixHis defined as: H=∇ 2L(θ) This is a symmetric matrix that captures the second-order...
work page 2026
-
[32]
mod|Γ| each time a shock epoch occurs. Thus every Cl epochs we devote exactlyoneepoch to a scale–shock whose value alternates 1→1.5→1→0.5→1→0.25→1→2.00→1→. . . . The multiplicative factor is appliedafterall layers andbeforeits non-linearity; all other epochs run withγ=1. 22 Preprint. Under review as a conference paper at ICLR 2026 D.2 DERIVATIVE-FLOOR RUL...
work page 2026
-
[33]
After 400 epochs, a new task arrives with an independent random labeling; we run50 tasks in sequence. Inputs are identical across tasks; only concepts change, directly probing plasticity versus interference. Random Label CIFAR(concept shift). Identical protocol to Random Label MNIST, but using images drawn from CIFAR–10. We again use a fixed subset of1,20...
work page 2024
-
[34]
Nonetheless, this tends holds across Leaky-ReLU, PReLU, Smooth-Leaky, and Rand
However, this is not a rule, and optimal ‘Goldilocks zone’ might vary between activations and settings. Nonetheless, this tends holds across Leaky-ReLU, PReLU, Smooth-Leaky, and Rand. Smooth-Leaky , and even for RReLU when considering the average of its bounds (important because that average initializes the effective leak). For these activations, we also ...
work page 2026
-
[35]
can the agent still perform well after repeated shifts on the data it now collects?
PReLU’sα indicates initial parameter value. Smooth-Leaky triplets indicate c, p, α, while Rand. Smooth-Leaky indicates c, p, and bounds [l, u]. The tuple of values from Rational indicates Version, ((P), (Q)), Function Approx. where (P) and (Q) are the numerator and denominator degrees respectively of the polynomial. Activation Mean±95% CI Plasticity Score...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.