Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective

Ali Payani; Claire Chen; Jayanth Srinivasa; Jiuqi Wang; Shangtong Zhang; Shuze Daniel Liu

arxiv: 2605.09044 · v1 · submitted 2026-05-09 · 💻 cs.LG

Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective

Jiuqi Wang , Jayanth Srinivasa , Claire Chen , Shuze Daniel Liu , Ali Payani , Shangtong Zhang This is my paper

Pith reviewed 2026-05-12 02:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningplasticityoptimization readinessgradient diagnosticstrainability predictiondeep neural networksloss of plasticity

0 comments

The pith

Optimization readiness lower bounds one-step optimization gain and predicts trainability in continual learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether diagnostics can predict loss of plasticity in deep neural networks during continual learning. It constructs counterexamples showing that representation rank and neural tangent kernel rank fail to predict trainability loss in regression and classification. A new metric, optimization readiness, is proposed by combining gradient strength and gradient reliability. Under standard smoothness assumptions, this metric is proven to lower bound the one-step optimization gain, providing a theoretical guarantee. Empirically, it ranks checkpoints more reliably than previous methods across benchmarks like Slowly-Changing Regression and Permuted MNIST, using fewer samples.

Core claim

Optimization readiness combines gradient strength and gradient reliability. Counterexamples demonstrate that representation rank and neural tangent kernel rank can fail to predict loss of trainability in both regression and classification. The central result proves that optimization readiness lower bounds one-step optimization gain under standard smoothness assumptions, offering a theoretical guarantee for predicting future trainability on target tasks.

What carries the argument

Optimization readiness, a metric integrating gradient strength with gradient reliability to lower-bound future one-step optimization gain.

Load-bearing premise

The loss function satisfies standard smoothness assumptions required for the lower-bound proof.

What would settle it

A smooth loss and network where optimization readiness is high but measured one-step optimization gain falls below the proven lower bound, or where the metric fails to rank checkpoints by trainability better than rank diagnostics in a new continual learning setting.

Figures

Figures reproduced from arXiv: 2605.09044 by Ali Payani, Claire Chen, Jayanth Srinivasa, Jiuqi Wang, Shangtong Zhang, Shuze Daniel Liu.

**Figure 2.** Figure 2: 1-, 10-, and 100-step gain vs. plasticity metric values for Slowly-Changing Regression and [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: 1-, 10-, and 100-step gain and plasticity metric values against checkpoints for Slowly [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Subsampling ablation study results for Slowly-Changing Regression and Permuted MNIST. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

Deep continual learning requires models to adapt to new tasks without retraining from scratch. However, neural networks can lose their ability to adapt to new tasks after training on previous ones, a phenomenon known as loss of plasticity. There have been several explanations and diagnostics proposed for plasticity loss. Motivated by the philosophy that "all models are wrong, but some are useful", we ask: can existing diagnostics predict a neural network's plasticity? In this work, we take a practical view to interpret plasticity as trainability, i.e., a neural network's future optimization gain on a target task. We first take a theoretical approach, showing, by constructing a few counterexamples, that some widely adopted diagnostics of plasticity, including representation rank and neural tangent kernel rank, can fail to predict the loss of trainability in both regression and classification settings. We instead propose a novel metric, called optimization readiness, which combines gradient strength and gradient reliability. We prove that optimization readiness lower bounds one-step optimization gain under standard smoothness assumptions, providing a theoretical guarantee for its predictive power. Empirically, we show that across commonly used deep continual learning settings, such as Slowly-Changing Regression and Permuted MNIST, optimization readiness more reliably ranks checkpoints by trainability than prior diagnostics, even with substantially fewer samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new gradient-based metric for predicting trainability in continual learning with a one-sided lower-bound proof, plus counterexamples against rank diagnostics, but the theory does not fully support reliable ranking of all checkpoints.

read the letter

This paper's main point is that rank-based diagnostics for loss of plasticity can fail, and the authors offer optimization readiness—gradient strength times reliability—as a replacement with a proof that it lower-bounds one-step optimization gain under L-smoothness. They back this with counterexamples in regression and classification, then show better ranking on Slowly-Changing Regression and Permuted MNIST using fewer samples than prior methods. That combination of targeted counterexamples and a concrete bound is the clearest advance here. The work is useful because it moves beyond pure heuristics toward something with a stated guarantee, and the fewer-sample claim is a practical advantage for real continual learning pipelines. The bound itself is standard and does not rely on fitted parameters, which keeps the circularity low. The soft spot is exactly the one-way nature of the result: high readiness forces positive gain, but low readiness does not rule out large gain. This leaves open false negatives in the ranking, so the empirical claim that it “more reliably ranks checkpoints by trainability” rests more on the two benchmarks than on the theory. Those benchmarks are standard but narrow; broader or more adversarial settings would make the comparison stronger. The paper is aimed at continual learning researchers who need diagnostics for checkpoint selection or model design. Readers who care about theoretical grounding for trainability will get value from the proof and counterexamples. It has enough new content and evidence to deserve a serious referee rather than a desk reject, though the review should press on the scope of the guarantee and ask for more diverse validation. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that common diagnostics for loss of plasticity (representation rank, NTK rank) fail to predict trainability, as shown via counterexamples in regression and classification. It introduces optimization readiness (gradient strength combined with gradient reliability), proves this metric lower-bounds one-step optimization gain under standard L-smoothness assumptions, and reports that it ranks checkpoints by future trainability more reliably than priors on Slowly-Changing Regression and Permuted MNIST, even with fewer samples.

Significance. If the central results hold, the work is significant: the counterexamples demonstrate concrete failures of prior metrics, the lower-bound proof supplies a formal guarantee under standard assumptions, and the empirical ranking results indicate practical utility with sample efficiency. This could improve checkpoint selection and monitoring in continual learning pipelines.

major comments (2)

[Theoretical analysis of optimization readiness] The lower-bound result shows only that sufficiently high optimization readiness forces non-trivial one-step gain. This one-directional implication does not rule out high actual gain for low-readiness checkpoints, which weakens support for the claim that the metric 'more reliably ranks checkpoints by trainability' across all cases (see the proof establishing the lower bound and the empirical ranking claims).
[Empirical evaluation on benchmarks] The empirical superiority is demonstrated on Slowly-Changing Regression and Permuted MNIST. The manuscript should verify whether the ranking advantage persists when the number of optimization steps or the degree of task similarity is varied, as these factors directly affect whether the one-step gain bound translates to reliable multi-checkpoint ordering.

minor comments (2)

[Definition of optimization readiness] The precise definitions of gradient strength and gradient reliability should be stated explicitly in the main text (with equation numbers) rather than deferred entirely to the appendix, to aid readability of the metric.
[Assumptions in the proof] A brief discussion of how the L-smoothness assumption relates to the non-convex losses typical in deep networks would help readers assess the practical scope of the theoretical guarantee.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point by point below, providing the strongest honest defense while acknowledging where revisions are warranted.

read point-by-point responses

Referee: [Theoretical analysis of optimization readiness] The lower-bound result shows only that sufficiently high optimization readiness forces non-trivial one-step gain. This one-directional implication does not rule out high actual gain for low-readiness checkpoints, which weakens support for the claim that the metric 'more reliably ranks checkpoints by trainability' across all cases (see the proof establishing the lower bound and the empirical ranking claims).

Authors: We agree that the lower bound is strictly one-directional: high optimization readiness guarantees non-trivial one-step gain under L-smoothness, but low readiness does not necessarily imply poor gain. This nuance means the metric provides a sufficient (not necessary) condition for trainability. Our primary support for superior ranking reliability comes from the empirical results on Slowly-Changing Regression and Permuted MNIST, where it outperformed rank-based diagnostics in ordering checkpoints by observed trainability, even with fewer samples. The bound supplies a formal guarantee in the regime where plasticity is preserved, complementing the counterexamples for prior metrics. We will revise the manuscript to explicitly note the one-directional character of the bound, clarify that ranking superiority is an empirical finding, and avoid any implication of necessity. revision: partial
Referee: [Empirical evaluation on benchmarks] The empirical superiority is demonstrated on Slowly-Changing Regression and Permuted MNIST. The manuscript should verify whether the ranking advantage persists when the number of optimization steps or the degree of task similarity is varied, as these factors directly affect whether the one-step gain bound translates to reliable multi-checkpoint ordering.

Authors: We concur that additional verification would strengthen the empirical claims. The reported results use the standard benchmark configurations, but we will incorporate new experiments that systematically vary the number of optimization steps (e.g., 1, 5, and 10 steps) and task similarity (by modulating the change rate in Slowly-Changing Regression and permutation intensity in Permuted MNIST). These extensions will directly test whether the ranking advantage and the utility of the one-step bound persist under conditions closer to multi-step continual learning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines optimization readiness directly from observable quantities (gradient strength and reliability) and derives its lower bound on one-step gain from standard L-smoothness assumptions on the loss function. This is a conventional theoretical implication rather than a reduction to fitted parameters, self-referential definitions, or load-bearing self-citations. Counterexamples for prior diagnostics are constructed independently, and empirical ranking comparisons on Slowly-Changing Regression and Permuted MNIST are presented as separate validation. No equations or claims reduce the proposed metric or its guarantee to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the definition of optimization readiness from gradient statistics and the invocation of standard smoothness assumptions to obtain the one-step gain lower bound. No free parameters or invented entities are introduced.

axioms (1)

domain assumption standard smoothness assumptions on the loss function
Invoked to prove that optimization readiness lower-bounds one-step optimization gain.

pith-pipeline@v0.9.0 · 5541 in / 1202 out tokens · 44518 ms · 2026-05-12T02:31:07.689158+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C. Machado. Loss of plasticity in continual deep reinforcement learning. In Proceedings of the Conference on Lifelong Learning Agents, 2023

work page 2023
[2]

Continual Learning in Neural Networks

Rahaf Aljundi. Continual Learning in Neural Networks. PhD thesis, KU Leuven, 2019

work page 2019
[3]

A unified noise-curvature view of loss of trainability

Gunbir Singh Baveja, Alex Lewandowski, and Mark Schmidt. A unified noise-curvature view of loss of trainability. In the NeurIPS Optimization for Machine Learning Workshop, 2025

work page 2025
[4]

George E. P. Box. Science and statistics. Journal of the American Statistical Association, 1976

work page 1976
[5]

Empirical model-building and response surface

George E P Box and Norman R Draper. Empirical model-building and response surface. John Wiley & Sons, Inc., 1986

work page 1986
[6]

Fernando Hernandez-Garcia, Parash Rahman, Richard S

Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Maintaining plasticity in deep continual learning. arXiv preprint arXiv:2306.13812, 2023

work page arXiv 2023
[7]

Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning. Nature, 2024

work page 2024
[8]

Spectral collapse drives loss of plasticity in deep continual learning

Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, and George Konidaris. Spectral collapse drives loss of plasticity in deep continual learning. arXiv preprint arXiv:2509.22335, 2025

work page arXiv 2025
[9]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018

work page 2018
[10]

M. G. Kendall. A new measure of rank correlation. Biometrika, 1938

work page 1938
[11]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015

work page 2015
[12]

Maintaining plasticity in continual learning via regenerative regularization

Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization. In Proceedings of the Conference on Lifelong Learning Agents, 2025

work page 2025
[13]

Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, and Marlos C. Machado. Directions of curvature as an explanation for loss of plasticity. arXiv preprint arXiv:2312.00246, 2024

work page arXiv 2024
[14]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017

work page 2017
[15]

Understanding plasticity in neural networks

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023
[16]

Disentangling the causes of plasticity loss in neural networks

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks. In Proceedings of the Conference on Lifelong Learning Agents, 2025

work page 2025
[17]

Revisiting plasticity in visual reinforcement learning: Data, modules and training stages

Guozheng Ma, Lu Li, Sen Zhang, Zixuan Liu, Zhen Wang, Yixin Chen, Li Shen, Xueqian Wang, and Dacheng Tao. Revisiting plasticity in visual reinforcement learning: Data, modules and training stages. In Proceedings of the International Conference on Learning Representations, 2024

work page 2024
[18]

Introductory lectures on convex optimization: A basic course

Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013

work page 2013
[19]

Deep reinforcement learning with plasticity injection

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and Andre Barreto. Deep reinforcement learning with plasticity injection. In Advances in Neural Information Processing Systems, 2023

work page 2023
[20]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019

work page 2019
[21]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023
[22]

Mitigating plasticity loss in continual reinforcement learning by reducing churn

Hongyao Tang, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Glen Berseth. Mitigating plasticity loss in continual reinforcement learning by reducing churn. In Proceedings of the International Conference on Machine Learning, 2025

work page 2025
[23]

The dual nature of plasticity loss in deep continual learning: Dissection and mitigation

Haoyu Wang, Wei P Dai, Jiawei Zhang, Jialun Ma, Mingyi Huang, and Yuguo Yu. The dual nature of plasticity loss in deep continual learning: Dissection and mitigation. In Advances in Neural Information Processing Systems, 2026

work page 2026

[1] [1]

Zaheer Abbas, Rosie Zhao, Joseph Modayil, Adam White, and Marlos C. Machado. Loss of plasticity in continual deep reinforcement learning. In Proceedings of the Conference on Lifelong Learning Agents, 2023

work page 2023

[2] [2]

Continual Learning in Neural Networks

Rahaf Aljundi. Continual Learning in Neural Networks. PhD thesis, KU Leuven, 2019

work page 2019

[3] [3]

A unified noise-curvature view of loss of trainability

Gunbir Singh Baveja, Alex Lewandowski, and Mark Schmidt. A unified noise-curvature view of loss of trainability. In the NeurIPS Optimization for Machine Learning Workshop, 2025

work page 2025

[4] [4]

George E. P. Box. Science and statistics. Journal of the American Statistical Association, 1976

work page 1976

[5] [5]

Empirical model-building and response surface

George E P Box and Norman R Draper. Empirical model-building and response surface. John Wiley & Sons, Inc., 1986

work page 1986

[6] [6]

Fernando Hernandez-Garcia, Parash Rahman, Richard S

Shibhansh Dohare, J Fernando Hernandez-Garcia, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Maintaining plasticity in deep continual learning. arXiv preprint arXiv:2306.13812, 2023

work page arXiv 2023

[7] [7]

Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning. Nature, 2024

work page 2024

[8] [8]

Spectral collapse drives loss of plasticity in deep continual learning

Naicheng He, Kaicheng Guo, Arjun Prakash, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, and George Konidaris. Spectral collapse drives loss of plasticity in deep continual learning. arXiv preprint arXiv:2509.22335, 2025

work page arXiv 2025

[9] [9]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018

work page 2018

[10] [10]

M. G. Kendall. A new measure of rank correlation. Biometrika, 1938

work page 1938

[11] [11]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015

work page 2015

[12] [12]

Maintaining plasticity in continual learning via regenerative regularization

Saurabh Kumar, Henrik Marklund, and Benjamin Van Roy. Maintaining plasticity in continual learning via regenerative regularization. In Proceedings of the Conference on Lifelong Learning Agents, 2025

work page 2025

[13] [13]

Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, and Marlos C. Machado. Directions of curvature as an explanation for loss of plasticity. arXiv preprint arXiv:2312.00246, 2024

work page arXiv 2024

[14] [14]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017

work page 2017

[15] [15]

Understanding plasticity in neural networks

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023

[16] [16]

Disentangling the causes of plasticity loss in neural networks

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, and Will Dabney. Disentangling the causes of plasticity loss in neural networks. In Proceedings of the Conference on Lifelong Learning Agents, 2025

work page 2025

[17] [17]

Revisiting plasticity in visual reinforcement learning: Data, modules and training stages

Guozheng Ma, Lu Li, Sen Zhang, Zixuan Liu, Zhen Wang, Yixin Chen, Li Shen, Xueqian Wang, and Dacheng Tao. Revisiting plasticity in visual reinforcement learning: Data, modules and training stages. In Proceedings of the International Conference on Learning Representations, 2024

work page 2024

[18] [18]

Introductory lectures on convex optimization: A basic course

Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013

work page 2013

[19] [19]

Deep reinforcement learning with plasticity injection

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and Andre Barreto. Deep reinforcement learning with plasticity injection. In Advances in Neural Information Processing Systems, 2023

work page 2023

[20] [20]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019

work page 2019

[21] [21]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023

[22] [22]

Mitigating plasticity loss in continual reinforcement learning by reducing churn

Hongyao Tang, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Glen Berseth. Mitigating plasticity loss in continual reinforcement learning by reducing churn. In Proceedings of the International Conference on Machine Learning, 2025

work page 2025

[23] [23]

The dual nature of plasticity loss in deep continual learning: Dissection and mitigation

Haoyu Wang, Wei P Dai, Jiawei Zhang, Jialun Ma, Mingyi Huang, and Yuguo Yu. The dual nature of plasticity loss in deep continual learning: Dissection and mitigation. In Advances in Neural Information Processing Systems, 2026

work page 2026