pith. sign in

arxiv: 2606.09762 · v1 · pith:5TMGYSGQnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI

Preserving Plasticity in Continual Learning via Dynamical Isometry

Pith reviewed 2026-06-27 17:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningplasticitydynamical isometryneural tangent kernelJacobian regularizationadaptive optimizerReLU reactivation
0
0 comments X

The pith

Dynamical isometry keeps neural networks able to learn new tasks by holding layer Jacobian singular values near one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that loss of plasticity during continual learning occurs when networks move away from dynamical isometry. It links this condition directly to the empirical neural tangent kernel and shows that networks can stay nearly isometric while still serving as universal approximators for nonlinear functions. An efficient regularization scheme and a decoupled optimizer are introduced to enforce the condition, and they are shown to match or exceed prior methods on benchmarks that deliberately induce plasticity collapse.

Core claim

Dynamical isometry is the mechanism that preserves plasticity because it keeps the empirical Neural Tangent Kernel well-conditioned across tasks. The authors establish this by revisiting almost-everywhere isometric networks that remain expressive approximators, by deriving a regularization term that promotes isometry and reactivates dormant ReLUs, and by constructing AdamO to apply the regularization separately from gradient steps. They further show that earlier plasticity methods only achieve partial isometry.

What carries the argument

Dynamical isometry: the requirement that the singular values of each layer's Jacobian stay close to one, which keeps gradient flow and the neural tangent kernel stable under non-stationary data.

If this is right

  • Near-dynamical isometry remains compatible with universal approximation by nonlinear networks.
  • The proposed regularizer reactivates dormant ReLU units as a side effect of promoting isometry.
  • AdamO applies isometry regularization without interfering with the base gradient updates.
  • Prior plasticity methods achieve only a partial form of isometry.
  • The same regularization and optimizer yield consistent gains on both supervised and reinforcement-learning continual benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If dynamical isometry is the dominant factor, then replay buffers and architectural resets become secondary rather than essential for long-horizon continual learning.
  • The same Jacobian-regularization idea could be tested in non-stationary control problems where the data distribution drifts continuously rather than in discrete tasks.
  • Measuring the full spectrum of Jacobian singular values rather than only their mean or condition number might give a stricter test of the isometry hypothesis.

Load-bearing premise

That the empirical Neural Tangent Kernel and layer-wise Jacobian singular values are the primary quantities that control plasticity loss.

What would settle it

A controlled run in which Jacobian singular values are forced to remain near one yet plasticity still collapses on a standard continual-learning benchmark, or in which isometry is deliberately broken yet plasticity is retained.

Figures

Figures reproduced from arXiv: 2606.09762 by Andries Rosseau, Ann Now\'e, Robert M\"uller.

Figure 1
Figure 1. Figure 1: Top to bottom: Results for Pixel-Permutation (40 epochs/task, 1000 tasks, batch size 250), Random-Label Mem￾orization (256 epochs/task, 200 tasks, batch size 128) and Label￾Shuffling (40 epochs/task, 1000 tasks, batch size 250). Results for ReDo and L2 Init have similar noise profiles as e.g., NaP, but are smoothed for Pixel-Permutation to not clog the figure. Pixel-Permutation and Random-Label Memorizatio… view at source ↗
Figure 2
Figure 2. Figure 2: Continual MinAtar with random channel permutations. Results are reported over 8 random seeds. Each environment runs for 15 million steps per cycle for a total of 1.2B steps. GS is GroupSort/MaxMin. vations, maintains consistent performance throughout the 20 cycles. Our method matches or outperforms the base￾lines on regular evaluation runs. (We detail explicit learning curves and additional RL diagnostics … view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity of AdamO to the orthogonal-regularization strength λ on the continual CIFAR-10 random-label benchmark. B.3. Additional diagnostic metrics We provide additional diagnostics for the CIFAR-10 pixel-permutation benchmark in Figures 4, 5, and 6, focusing respectively on the weight spectra, empirical NTK statistics, and intermediate Jacobian spectra. These figures complement the performance plots in … view at source ↗
Figure 4
Figure 4. Figure 4: Weight-space diagnostics for CIFAR-10 pixel permutation, including singular-value statistics, effective rank, and the weight condition ratio over training. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Core empirical NTK diagnostics for CIFAR-10 pixel permutation, including condition number, eigenvalue spread, participation rank, and isotropy gap. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Intermediate-layer Jacobian diagnostics for CIFAR-10 pixel permutation, showing how the singular-value spectrum and conditioning evolve through training. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dormant-unit diagnostics for the supervised continual-learning experiments. The figure shows how the number of Sokar-style dormant ReLU units evolves during training, and how the isometry-preserving methods reduce or reverse the buildup of inactive features relative to the baselines. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training curves for MinAtar games. The color gradient indicates early (light) to late (dark) cycles (20 cycles). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Weight-spectrum diagnostics for continual MinAtar. The plotted quantities summarize the singular-value distribution of the learned weight operators over training, making visible whether layers develop strong anisotropic directions or preserve a tighter spectrum. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise weight effective-rank diagnostics for continual MinAtar. These panels track how many singular directions of each layer remain meaningfully used, helping distinguish balanced representations from spectra that collapse onto a small subspace. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full Jacobian diagnostics for continual MinAtar. These figures summarize the singular-value spectrum of the input-output Jacobian, directly probing dynamical isometry through quantities such as singular-value spread and conditioning. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layer-wise Jacobian diagnostics for continual MinAtar. Unlike the full Jacobian view, these panels localize where along the network depth singular values drift away from one, revealing which layers are responsible for deteriorating gradient transport. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Empirical NTK diagnostics for continual MinAtar. These panels track kernel conditioning and rank-related quantities, indicating how isotropically parameter updates can move the represented function in output space over time. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Dormant-neuron diagnostics for continual MinAtar. These figures quantify inactive or weakly active units, making the connection between revival of dormant features and preserved plasticity explicit. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Per-game evaluation performance for continual MinAtar. Breaking the aggregate score down by environment shows whether gains come from broadly preserved plasticity across games rather than from improvements on only a small subset. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Training curves for MinAtar games. The color gradient indicates early (light) to late (dark) cycles (20 cycles). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Training curves for Octax games showing return on evaluation environments against training steps. The line style indicates the cycle within the continual experiment. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Weight-spectrum diagnostics for continual Octax. These panels summarize the singular-value statistics of the convolutional and linear operators, highlighting whether training preserves a balanced spectrum or develops highly anisotropic directions. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Layer-wise weight effective-rank diagnostics for continual Octax. Effective rank measures how broadly each layer uses its singular directions, complementing raw norm or spectral diagnostics with a notion of dimensional richness. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Full Jacobian diagnostics for continual Octax. These plots summarize the singular-value spectrum of the end-to-end Jacobian and therefore directly monitor whether the network stays near a dynamically isometric regime. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Layer-wise Jacobian diagnostics for continual Octax. By decomposing Jacobian statistics across depth, these panels identify where conditioning degrades and where isometry-preserving methods stabilize signal propagation. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Empirical NTK diagnostics for continual Octax. These figures track kernel condition numbers and rank-related measures, which quantify how uniformly policy/value gradients can move the represented function. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Dormant-neuron diagnostics for continual Octax. These panels measure inactivity and feature collapse in the network, providing an RL-side analogue of the dormant-unit behavior discussed for supervised settings. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Validation accuracy for the continual CIFAR-10 pixel-permutation transformer experiments at learning rate 10−2 . This plot summarizes the primary stability and performance comparison across the transformer variants considered in the rebuttal study. 80 160 240 320 400 480 560 Task 2 3 4 5 6 7 8 9 IO Jac. Effective Rank ViT CIFAR-10 PP Skips, lr=1e-2 AdamO LN LN + WD=0.01 Vanilla 80 160 240 320 400 480 560 … view at source ↗
Figure 25
Figure 25. Figure 25: Input-output Jacobian diagnostics for the transformer experiments with skip connections at learning rate 10−2 . Left: input￾output effective rank, which measures how many singular directions of the end-to-end Jacobian remain meaningfully used. Right: input-output condition number, which captures worst-case anisotropy of gradient transport. Together these panels directly probe whether the transformer remai… view at source ↗
Figure 26
Figure 26. Figure 26: Weight-spectrum diagnostics for the transformer experiments with skip connections. The effective-rank panel measures how broadly each layer uses its singular directions; the mean-squared singular-value panel tracks preservation of the overall weight scale; and the smallest and largest singular-value panels expose anisotropic extremes. These figures make clear whether the optimizer preserves a balanced spe… view at source ↗
Figure 27
Figure 27. Figure 27: Dormant-neuron diagnostics for the transformer experiments with skip connections. This plot tracks the buildup of inactive or weakly active units over training and shows whether improved conditioning is accompanied by reduced feature collapse. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗
read the original abstract

Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that loss of plasticity in continual learning can be understood and mitigated through the lens of dynamical isometry (layer-wise Jacobian singular values near 1), which preserves the empirical Neural Tangent Kernel. It shows that near-isometric networks remain universal Lipschitz approximators, introduces an isometry-promoting regularizer that can reactivate dormant ReLUs, proposes the AdamO optimizer (decoupling isometry regularization from gradients, analogous to AdamW), reinterprets prior plasticity methods as targeting only partial isometry, and reports that the resulting methods match or outperform existing approaches on supervised and RL continual-learning benchmarks designed to induce plasticity loss.

Significance. If the mechanism attribution holds, the work offers a unifying perspective on plasticity preservation and practical tools (regularizer + AdamO) that could be adopted broadly. Credit is due for the explicit analogy to AdamW, the demonstration that near-dynamical isometry is compatible with expressive nonlinear representations, and the reinterpretation of prior methods through the isometry lens.

major comments (2)
  1. [Abstract] Abstract: the central claim that dynamical isometry is the operative mechanism (via its effect on the eNTK) is load-bearing for the contribution, yet the text provides no ablation or control that holds feature drift, loss-landscape curvature, and other factors fixed while varying only the layer-wise Jacobian singular-value distribution. Without such isolation, the reported outperformance cannot be attributed specifically to isometry.
  2. [Abstract] Abstract: the asserted relation between plasticity and the empirical Neural Tangent Kernel is stated without an explicit equation or derivation showing how the singular-value condition controls the eNTK spectrum under non-stationary data; this link is required to make the mechanism claim falsifiable rather than correlational.
minor comments (2)
  1. [Abstract] The abstract refers to 'benchmarks designed to induce plasticity loss' but supplies neither dataset names, task sequences, nor hyper-parameter details; these must be added for reproducibility.
  2. Notation for the isometry-promoting regularizer and the precise definition of 'near-dynamical isometry' (e.g., tolerance on singular values) should be introduced with an equation in the main text rather than left implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the mechanism claims. We address each major comment below and outline planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that dynamical isometry is the operative mechanism (via its effect on the eNTK) is load-bearing for the contribution, yet the text provides no ablation or control that holds feature drift, loss-landscape curvature, and other factors fixed while varying only the layer-wise Jacobian singular-value distribution. Without such isolation, the reported outperformance cannot be attributed specifically to isometry.

    Authors: We agree that a fully isolated ablation varying only the Jacobian singular-value distribution while holding feature drift and curvature fixed would provide stronger causal evidence. Such a control is difficult to construct in non-stationary continual learning because changes to the Jacobian spectrum necessarily influence representations and optimization dynamics. Our regularizer is explicitly designed to target isometry (via a penalty on deviation of singular values from 1) without directly regularizing other quantities, and we compare against baselines that do not enforce this property. In revision we will add a dedicated limitations paragraph discussing this point and explaining why complete isolation remains challenging. revision: partial

  2. Referee: [Abstract] Abstract: the asserted relation between plasticity and the empirical Neural Tangent Kernel is stated without an explicit equation or derivation showing how the singular-value condition controls the eNTK spectrum under non-stationary data; this link is required to make the mechanism claim falsifiable rather than correlational.

    Authors: Section 3 of the manuscript derives the link: the eNTK is the Gram matrix of the product of layer-wise Jacobians, so that the spectrum of the eNTK remains well-conditioned precisely when the singular values of each Jacobian stay near 1. Under non-stationary data this prevents progressive ill-conditioning that correlates with plasticity loss. We will insert a concise reference to this derivation (or a one-sentence summary) into the abstract and introduction to make the mechanistic claim more explicit and falsifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation remains independent of its inputs

full rationale

The paper relates plasticity loss to the empirical NTK and identifies dynamical isometry (Jacobian singular values near 1) as a mechanism, then proposes an isometry-promoting regularizer and AdamO optimizer while reinterpreting prior methods. No equations, fitted parameters, or self-citations are shown that reduce any central claim to a tautology or post-hoc fit by construction. The identification of isometry as key is presented as an empirical and theoretical observation rather than a definitional equivalence, and the new methods are introduced as independent contributions. This satisfies the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the untested domain assumption that Jacobian singular values near one are the dominant control on plasticity.

axioms (1)
  • domain assumption Plasticity loss can be diagnosed and mitigated via the empirical Neural Tangent Kernel and layer-wise Jacobian singular values.
    Stated in the first two sentences of the abstract as the basis for the proposed mechanism.

pith-pipeline@v0.9.1-grok · 5717 in / 1295 out tokens · 15470 ms · 2026-06-27T17:24:38.351082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Layer Normalization

    Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

  2. [2]

    Norm-preserving Orthogonal Permutation Linear Unit Activation Functions (OPLU)

    Chernodub, A. and Nowicki, D. Norm-preserving orthog- onal permutation linear unit activation functions (oplu). arXiv preprint arXiv:1604.02313,

  3. [3]

    arXiv preprint arXiv:2103.00065 , year=

    Cohen, J. M., Kaur, S., Li, Y ., Kolter, J. Z., and Talwalkar, A. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

  4. [4]

    An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

    Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y . An empirical investigation of catastrophic for- getting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

  5. [5]

    Ac- celerating newton-schulz iteration for orthogonaliza- tion via chebyshev-type polynomials.arXiv preprint arXiv:2506.10935,

    Grishina, E., Smirnov, M., and Rakhuba, M. Ac- celerating newton-schulz iteration for orthogonaliza- tion via chebyshev-type polynomials.arXiv preprint arXiv:2506.10935,

  6. [6]

    Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  7. [7]

    Maintaining plas- ticity in continual learning via regenerative regularization

    Kumar, S., Marklund, H., and Van Roy, B. Maintaining plas- ticity in continual learning via regenerative regularization. arXiv preprint arXiv:2308.11958,

  8. [8]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

  9. [9]

    Lu, L., Shin, Y ., Su, Y ., and Karniadakis, G. E. Dying relu and initialization: Theory and numerical examples.arXiv preprint arXiv:1903.06733,

  10. [10]

    Understanding and preventing capacity loss in reinforcement learning.arXiv preprint arXiv:2204.09560,

    Lyle, C., Rowland, M., and Dabney, W. Understanding and preventing capacity loss in reinforcement learning.arXiv preprint arXiv:2204.09560,

  11. [11]

    P., Pascanu, R., and Dabney, W

    10 Preserving Plasticity in Continual Learning via Dynamical Isometry Lyle, C., Zheng, Z., Khetarpal, K., Martens, J., van Hasselt, H. P., Pascanu, R., and Dabney, W. Normalization and effective learning rates in reinforcement learning.Ad- vances in Neural Information Processing Systems, 37: 106440–106473, 2024a. Lyle, C., Zheng, Z., Khetarpal, K., van Ha...

  12. [12]

    The emer- gence of spectral universality in deep networks

    Pennington, J., Schoenholz, S., and Ganguli, S. The emer- gence of spectral universality in deep networks. InInter- national Conference on Artificial Intelligence and Statis- tics, pp. 1924–1932. PMLR,

  13. [13]

    URL https://arxiv.org/abs/2510. 01764. Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120,

  14. [14]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  15. [15]

    L2 Regularization versus Batch and Weight Normalization

    Van Laarhoven, T. L2 regularization versus batch and weight normalization.arXiv preprint arXiv:1706.05350,

  16. [16]

    MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments

    URL https://arxiv.org/abs/ 1903.03176. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking general- ization.arXiv preprint arXiv:1611.03530,

  17. [17]

    We run 8 seeds per method

    Learning rate for Adam and AdamO is 1e-3. We run 8 seeds per method. For Lipschitz-1 networks, we leave the output head unrestricted (or regularize it with a softer regularizer) so the network can approximate L-lipschitz functions. Algorithm hyperparameters: • For AdamO, we use a regularization strength of 1e-3 for the orthogonal penalty. Learning rate is...

  18. [18]

    This provides direct empirical support for the revival mechanism discussed in Section 4.3

    The figure tracks how the number of Sokar-style dormant ReLU units evolves during training, and highlights that the isometry- preserving methods substantially reduce or reverse the buildup of inactive features relative to the baselines. This provides direct empirical support for the revival mechanism discussed in Section 4.3. B.5. Minatar The RL diagnosti...

  19. [19]

    Octax We consider the environemnts ”brix”, ”submarine”, ”filter”, ”tank”, ”blinky”, ”missile”, ”ufo”, ”wipeoff” in this sequence for 3 cycles with 5 million training steps per env

    B.6. Octax We consider the environemnts ”brix”, ”submarine”, ”filter”, ”tank”, ”blinky”, ”missile”, ”ufo”, ”wipeoff” in this sequence for 3 cycles with 5 million training steps per env. We utilise PPO with shared backbone following the implementation provided in (Radji et al., 2025). A precise list of our hyperparameters is given in table