pith. machine review for the scientific record. sign in

arxiv: 2605.04712 · v2 · submitted 2026-05-06 · 💻 cs.LG

Recognition: no theorem link

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture of expertscontinual reinforcement learningspectral plasticityneural tangent kernelparseval penaltyplasticity lossdeep reinforcement learning
0
0 comments X

The pith

A Parseval penalty on expert feature matrices prevents loss of spectral plasticity in mixture-of-experts policies for continual reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that mixture-of-experts networks in deep reinforcement learning lose their capacity to acquire new skills from ongoing experience streams because their spectral plasticity declines. Using neural tangent kernel theory, the authors derive a computable proxy for this plasticity that depends only on the feature matrices of the individual experts. They then introduce SPHERE as a regularization term that applies a Parseval penalty to those matrices, keeping the proxy value from dropping. If the approach holds, agents built on mixture-of-experts layers would continue to adapt to new tasks without the performance collapse that otherwise appears after extended training.

Core claim

Building on Neural Tangent Kernel theory, plasticity loss in MoE policies is formalized as a loss of spectral plasticity. A tractable proxy for this quantity is derived directly from the feature matrices of the separate experts. SPHERE is then defined as a Parseval penalty tailored to these matrices that keeps the proxy from falling. When tested on MetaWorld and HumanoidBench under continual RL, the regularized policies achieve 133 percent and 50 percent higher average success than an unregularized MoE baseline while recording higher spectral-plasticity values at every stage of training.

What carries the argument

SPHERE, the Parseval penalty applied to the feature matrices of the individual experts inside the mixture-of-experts policy; it directly regularizes the NTK-derived proxy for spectral plasticity.

If this is right

  • MoE policies retain the ability to learn diverse skills from new experience without degeneration over extended continual RL training.
  • The spectral-plasticity proxy remains higher for the entire duration of training when the Parseval penalty is applied.
  • Average task success rises by 133 percent on MetaWorld and 50 percent on HumanoidBench relative to the unregularized MoE baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same NTK-derived proxy could be used to monitor plasticity loss in mixture-of-experts models outside reinforcement learning.
  • Similar penalties might reduce the need for auxiliary techniques such as periodic network resets in long-horizon continual learning.
  • If the proxy correlates with actual adaptation speed, it could serve as an early diagnostic for when an MoE policy is about to lose plasticity.

Load-bearing premise

The tractable proxy for spectral plasticity, expressed in terms of individual expert feature matrices and derived from NTK theory, accurately reflects the true loss of plasticity in MoE policies during continual RL training.

What would settle it

An experiment in which the proxy value is tracked alongside a direct test of new-task acquisition speed after long training; if the regularized and unregularized agents show identical new-task learning curves despite large differences in the proxy, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.04712 by Cong Fang, Guoxi Zhang, Hongming Xu, Lirui Luo, Qing Li.

Figure 1
Figure 1. Figure 1: Plasticity loss as loss of spectral plasticity. Left: Continual RL can fail on later tasks despite isolated learnability. Right: By Eq. (6), ∆f = −η K ∇fL; low effective rank of the empirical Neural Tangent Kernel (eNTK) restricts updates to a few directions (collapsed spectrum), while high eNTK effective rank enables diverse directions (isotropic spectrum). the policy to improve from new experience (Kumar… view at source ↗
Figure 4
Figure 4. Figure 4: Under CRL, SPHERE improves average success by 133% over Top-K MoE and reduces the RL–CRL gap by 52%. We report average final success rates on MetaWorld across methods under RL and CRL. SPHERE outperforms other mitigation baselines. PPO PPO(10x) Top-K MoE LN SPHERE PW C-CHAIN CBP 0.00 0.25 0.50 0.75 1.00 Success Rate 0.54 0.62 0.55 0.61 0.75 0.62 0.57 0.49 0.36 0.41 0.36 0.50 0.54 0.40 0.41 0.45 RL CRL view at source ↗
Figure 5
Figure 5. Figure 5: Relative to Top-K MoE, SPHERE improves average success by 36% under RL and 50% under CRL. We report av￾erage final success rates on HumanoidBench across methods under RL and CRL. SPHERE outperforms other mitigation baselines. than MetaWorld, which makes this within-task decay more pronounced and amplifies the benefit of SPHERE under RL. Consistent with this, view at source ↗
Figure 6
Figure 6. Figure 6: SPHERE avoids spectral collapse to a single component. We visualize held-out states with t-SNE and color each state by the singular direction of the weighted expert feature matrix with the largest absolute projection. Columns show snapshots along the HumanoidBench CRL sequence. Top: w/o SPHERE. Bottom: w/ SPHERE view at source ↗
Figure 7
Figure 7. Figure 7: Expert feature isotropy tracks spectral plasticity. We visualize a scatter plot of re(A exp last) versus re(K). The Pearson correlation is r=0.846. This supports using the SPHERE penalty on A exp last as a practical proxy for spectral plasticity. Proposition 4.2, this loss of feature diversity manifests as a collapsed weighted feature Gram A exp last, which shrinks the expert-block surrogate and, via Propo… view at source ↗
Figure 8
Figure 8. Figure 8: Load balancing is positively associated with spectral plasticity. The eNTK effective rank re(K) is measured on Humanoid￾Bench for a Top-K MoE actor with and without a load-balancing objective. Load balancing maintains higher re(K), consistent with the spectral view that redistributing routing-induced trace mass across experts helps prevent collapse of functional update directions view at source ↗
Figure 9
Figure 9. Figure 9: The spectral-plasticity trend persists when re(K) is computed on online rollout states. Top-K MoE exhibits effective-rank decay, while SPHERE maintains a higher effective rank throughout training. 41 view at source ↗
Figure 10
Figure 10. Figure 10: SPHERE is robust to the ratio hyperparameter ρ on HumanoidBench under CRL. Any ρ > 0 improves five-task average success over ρ = 0 by +0.09 to +0.17, with a broad optimum around ρ = 10−3 . J.10.4. CAUSAL ANALYSIS Negative (sign-flip) intervention. To probe whether SPHERE’s effect on performance is mediated by spectral plasticity, we run a counterfactual intervention that flips the sign of the SPHERE ratio… view at source ↗
Figure 11
Figure 11. Figure 11: Under the sign-flip intervention, both re(K) and success collapse. Evaluation success rate and re(K) over training steps for ρ = −10−3 . Lead–lag evidence (negative intervention). We quantify whether changes in re(K) precede changes in success by computing the lagged correlation corr(successt, re(K)t−ℓ) over evaluation checkpoints, with ℓ measured in evaluation ticks. As shown in Figures 11 and 12, re(K) … view at source ↗
Figure 12
Figure 12. Figure 12: re(K) tends to lead success under the negative intervention. We report the correlation between successt and re(K)t−ℓ across evaluation checkpoints. Positive lag ℓ > 0 means re(K) is shifted earlier. J.10.5. ALGORITHM AGNOSTICISM Algorithm Agnosticism. To test whether SPHERE is algorithm-agnostic, we apply the same penalty to a Top-K MoE critic trained with TD3 under CRL view at source ↗
Figure 13
Figure 13. Figure 13: Gate–expert coupling is weak in a Top-K MoE actor. We visualize the Gauss–Newton block cosine similarity cos(a, b) between the gate and expert parameter groups on the HumanoidBench run task. Blocks are ordered as {gate, expert} to match Eq. (15), and block widths and heights are proportional to parameter counts. Small off-diagonal values indicate weak gate–expert coupling relative to within-block curvatur… view at source ↗
Figure 14
Figure 14. Figure 14: shows a strong positive association, supporting the Kronecker proxy as a faithful surrogate for analyzing and optimizing expert-layer spectral properties. 40 45 50 55 60 re(G GN, exp out ) 26 28 30 32 34 36 re(G GN, exp out ) Pearson r=0.861 view at source ↗
Figure 15
Figure 15. Figure 15: K-FAC proxy curvatures closely match empirical Gauss–Newton curvatures at the expert output block. The K-FAC proxy curvature qKFAC concentrates near the diagonal when compared to the empirical Gauss–Newton curvature qGN. Pearson correlation between log qKFAC and log qGN is r=0.980. J.14. Applicability Beyond RL and MLP Settings J.14.1. VISUAL METAWORLD WITH A RESNET BACKBONE We further evaluate SPHERE in … view at source ↗
read the original abstract

In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Mixture-of-Experts policies in continual deep RL suffer from loss of spectral plasticity, which can be formalized via NTK theory as a tractable proxy expressible in terms of individual expert feature matrices; SPHERE, a Parseval penalty based on this proxy, is introduced to mitigate the issue and yields 133% and 50% gains in average success rate over unregularized MoE baselines on MetaWorld and HumanoidBench while preserving higher spectral plasticity throughout training.

Significance. If the NTK-derived proxy is shown to accurately track true plasticity loss rather than serving as generic regularization, the work offers a principled, scalable approach to maintaining learning capacity in MoE architectures for non-stationary RL; the reported gains on two standard continual-RL benchmarks constitute a concrete empirical contribution, and the explicit grounding in NTK theory is a strength that could enable further theoretical analysis.

major comments (3)
  1. [§3.2] §3.2 (derivation of the tractable proxy): the proxy is obtained by linearizing the MoE policy under NTK assumptions (infinite width, fixed data distribution at initialization); continual RL violates these via finite-width experts, non-stationary task streams, and policy updates far from initialization, so the manuscript must demonstrate (via correlation plots or ablation) that the proxy remains predictive of actual degradation in new-task performance rather than merely acting as a tunable regularizer.
  2. [§4.3] §4.3 and Table 2: the 133% and 50% average-success improvements are reported without error bars, number of seeds, or statistical tests; because the central claim is that SPHERE specifically mitigates spectral-plasticity loss (rather than generic regularization), these omissions make it impossible to judge whether the gains are robust or reproducible.
  3. [§3.1] §3.1 (formalization of spectral plasticity): the loss is defined via the smallest eigenvalue of the NTK Gram matrix restricted to expert features; the paper should clarify whether this quantity is computed exactly or approximated, and whether the approximation remains valid once experts are updated during continual training.
minor comments (2)
  1. [Abstract] The abstract states the performance gains but omits any mention of variance, number of runs, or hyper-parameter sensitivity; adding these details would strengthen the empirical claims.
  2. [§3.3] Notation for the Parseval penalty (Eq. (X)) should explicitly state how the coefficient is chosen or tuned; the current description leaves open whether it is a fixed hyper-parameter or derived from the proxy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and commit to revising the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (derivation of the tractable proxy): the proxy is obtained by linearizing the MoE policy under NTK assumptions (infinite width, fixed data distribution at initialization); continual RL violates these via finite-width experts, non-stationary task streams, and policy updates far from initialization, so the manuscript must demonstrate (via correlation plots or ablation) that the proxy remains predictive of actual degradation in new-task performance rather than merely acting as a tunable regularizer.

    Authors: We acknowledge that the NTK assumptions are idealized and do not hold exactly under continual RL. In the revised manuscript we will add correlation plots relating the proxy values to measured new-task performance degradation across training checkpoints. We will also include ablations comparing SPHERE against alternative regularizers to isolate its effect on spectral plasticity. revision: yes

  2. Referee: [§4.3] §4.3 and Table 2: the 133% and 50% average-success improvements are reported without error bars, number of seeds, or statistical tests; because the central claim is that SPHERE specifically mitigates spectral-plasticity loss (rather than generic regularization), these omissions make it impossible to judge whether the gains are robust or reproducible.

    Authors: We agree that these statistical details are necessary. We will revise Table 2 and the experimental section to report mean ± standard deviation over 5 random seeds, state the seed count explicitly, and add statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank) between SPHERE and the unregularized baseline. revision: yes

  3. Referee: [§3.1] §3.1 (formalization of spectral plasticity): the loss is defined via the smallest eigenvalue of the NTK Gram matrix restricted to expert features; the paper should clarify whether this quantity is computed exactly or approximated, and whether the approximation remains valid once experts are updated during continual training.

    Authors: The smallest eigenvalue is computed exactly from the Gram matrix of the current expert feature matrices at each evaluation checkpoint. We will add a clarifying paragraph in §3.1 describing this exact computation and discuss its continued empirical validity during training, consistent with the spectral-plasticity tracking already shown throughout the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation grounded in external NTK theory with empirical validation

full rationale

The paper formalizes plasticity loss via NTK theory (external), derives a tractable proxy expressible in expert feature matrices, and introduces SPHERE as a Parseval penalty based on that proxy. Performance improvements (133%/50%) are shown via experiments on MetaWorld and HumanoidBench rather than by construction. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Approach rests on NTK theory applied to MoE policies and a derived proxy; no new physical entities or heavily fitted constants are introduced in the abstract.

free parameters (1)
  • penalty strength coefficient
    Regularization hyperparameter whose value is chosen to balance plasticity preservation and task performance; not specified in abstract.
axioms (1)
  • domain assumption Neural Tangent Kernel theory provides a valid linearization for analyzing plasticity in trained MoE policies under continual RL updates.
    Invoked to formalize spectral plasticity loss and derive the tractable proxy.

pith-pipeline@v0.9.0 · 5505 in / 1038 out tokens · 36551 ms · 2026-05-11T01:46:09.587964+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , year=

    Continual world: A robotic benchmark for continual reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Wide neural networks of any depth evolve as linear models under gradient descent , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    Proceedings of the 34th International Conference on Machine Learning , year=

    Practical Gauss-Newton Optimisation for Deep Learning , author=. Proceedings of the 34th International Conference on Machine Learning , year=

  4. [5]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  5. [6]

    The Twelfth International Conference on Learning Representations , year=

    Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages , author=. The Twelfth International Conference on Learning Representations , year=

  6. [7]

    Proceedings of the Conference on Robot Learning , year=

    Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , author=. Proceedings of the Conference on Robot Learning , year=

  7. [8]

    International Conference on Machine Learning , year=

    Controlling overestimation bias with truncated mixture of continuous distributional quantile critics , author=. International Conference on Machine Learning , year=

  8. [9]

    A Bradford Book , year=

    Reinforcement learning: An introduction , author=. A Bradford Book , year=

  9. [10]

    International Journal of Information and Systems Sciences , year=

    Hadamard, Khatri-Rao, Kronecker and other matrix products , author=. International Journal of Information and Systems Sciences , year=

  10. [11]

    Plasticity Loss in Deep Reinforcement Learning: A Survey

    Plasticity Loss in Deep Reinforcement Learning: A Survey , author=. arXiv preprint arXiv:2411.04832 , year=

  11. [12]

    2007 15th European signal processing conference , year=

    The effective rank: A measure of effective dimensionality , author=. 2007 15th European signal processing conference , year=

  12. [13]

    International Conference on Machine Learning , year=

    The dormant neuron phenomenon in deep reinforcement learning , author=. International Conference on Machine Learning , year=

  13. [14]

    Advances in Neural Information Processing Systems , year=

    Deep reinforcement learning with plasticity injection , author=. Advances in Neural Information Processing Systems , year=

  14. [15]

    Conference on lifelong learning agents , year=

    Loss of plasticity in continual deep reinforcement learning , author=. Conference on lifelong learning agents , year=

  15. [16]

    Trends in neurosciences , year=

    Memory retention--the synaptic stability versus plasticity dilemma , author=. Trends in neurosciences , year=

  16. [17]

    Nature , year=

    Loss of plasticity in deep continual learning , author=. Nature , year=

  17. [18]

    International Conference on Learning Representations , year=

    Implicit under-parameterization inhibits data-efficient deep reinforcement learning , author=. International Conference on Learning Representations , year=

  18. [19]

    Brain mechanisms in conditioning and learning , author=

  19. [20]

    International Conference on Learning Representations , year=

    Understanding and Preventing Capacity Loss in Reinforcement Learning , author=. International Conference on Learning Representations , year=

  20. [21]

    International Conference on Machine Learning , year=

    Understanding plasticity in neural networks , author=. International Conference on Machine Learning , year=

  21. [22]

    Frontiers in Cellular Neuroscience , year=

    The impact of studying brain plasticity , author=. Frontiers in Cellular Neuroscience , year=

  22. [23]

    International conference on machine learning , year=

    The primacy bias in deep reinforcement learning , author=. International conference on machine learning , year=

  23. [24]

    International Conference on Machine Learning , year=

    Mixtures of Experts unlock parameter scaling for deep RL , author=. International Conference on Machine Learning , year=

  24. [25]

    Reinforcement Learning Journal , year=

    Mixture of Experts in a Mixture of RL settings , author=. Reinforcement Learning Journal , year=

  25. [26]

    International Conference on Machine Learning , year=

    Transient non-stationarity and generalisation in deep reinforcement learning , author=. International Conference on Machine Learning , year=

  26. [27]

    International Conference on Learning Representations , year=

    Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL , author=. International Conference on Learning Representations , year=

  27. [28]

    IEEE/RSJ International Conference on Intelligent Robots and Systems , year=

    MoE-Loco: Mixture of Experts for Multitask Locomotion , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems , year=

  28. [30]

    International Conference on Machine Learning , year=

    Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning , author=. International Conference on Machine Learning , year=

  29. [31]

    International Conference on Learning Representations , year=

    DrM: Mastering visual reinforcement learning through dormant ratio minimization , author=. International Conference on Learning Representations , year=

  30. [32]

    International Conference on Learning Representations , year=

    Neuroplastic expansion in deep reinforcement learning , author=. International Conference on Learning Representations , year=

  31. [33]

    The Thirteenth International Conference on Learning Representations , year=

    Prevalence of Negative Transfer in Continual Reinforcement Learning: Analyses and a Simple Baseline , author=. The Thirteenth International Conference on Learning Representations , year=

  32. [34]

    Aneesh Muppidi and Zhiyu Zhang and Heng Yang , booktitle=. Fast

  33. [35]

    Proceedings of the 42nd International Conference on Machine Learning , year =

    Knowledge Retention in Continual Model-Based Reinforcement Learning , author =. Proceedings of the 42nd International Conference on Machine Learning , year =

  34. [36]

    The Thirteenth International Conference on Learning Representations , year=

    Theory on Mixture-of-Experts in Continual Learning , author=. The Thirteenth International Conference on Learning Representations , year=

  35. [37]

    Conference on Lifelong Learning Agents , pages=

    Measuring and mitigating interference in reinforcement learning , author=. Conference on Lifelong Learning Agents , pages=. 2023 , organization=

  36. [38]

    International conference on machine learning , year=

    Addressing function approximation error in actor-critic methods , author=. International conference on machine learning , year=

  37. [39]

    2012 , publisher=

    Matrix analysis , author=. 2012 , publisher=

  38. [40]

    2013 , publisher=

    Matrix computations , author=. 2013 , publisher=

  39. [41]

    1979 , publisher=

    Inequalities: theory of majorization and its applications , author=. 1979 , publisher=

  40. [42]

    2006 , publisher=

    Pattern recognition and machine learning , author=. 2006 , publisher=

  41. [43]

    2016 , publisher=

    Deep Learning , author=. 2016 , publisher=

  42. [44]

    International Conference on Learning Representations , year=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=

  43. [45]

    Journal of Machine Learning Research , year=

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , year=

  44. [46]

    Advances in Neural Information Processing Systems , year=

    Mixture-of-Experts with Expert Choice Routing , author=. Advances in Neural Information Processing Systems , year=

  45. [47]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Auxiliary-loss-free load balancing strategy for mixture-of-experts , author=. arXiv preprint arXiv:2408.15664 , year=

  46. [48]

    Advances in Neural Information Processing Systems , year=

    Parseval Regularization for Continual Reinforcement Learning , author=. Advances in Neural Information Processing Systems , year=

  47. [50]

    International Conference on Machine Learning , year=

    Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn , author=. International Conference on Machine Learning , year=

  48. [51]

    Conference on Lifelong Learning Agents , year=

    Disentangling the Causes of Plasticity Loss in Neural Networks , author=. Conference on Lifelong Learning Agents , year=

  49. [52]

    International Conference on Learning Representations , year =

    Spectral Normalization for Generative Adversarial Networks , author =. International Conference on Learning Representations , year =

  50. [53]

    Proceedings of the 7th International Conference on Learning Representations , year =

    The Singular Values of Convolutional Layers , author =. Proceedings of the 7th International Conference on Learning Representations , year =

  51. [54]

    International conference on machine learning , year=

    Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , year=

  52. [55]

    International Conference on Machine Learning , year=

    A kronecker-factored approximate fisher matrix for convolution layers , author=. International Conference on Machine Learning , year=

  53. [56]

    Advances in Neural Information Processing Systems , year =

    Theoretical Characterisation of the Gauss Newton Conditioning in Neural Networks , author=. Advances in Neural Information Processing Systems , year =

  54. [57]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

  55. [58]

    Technical University of Denmark , year=

    The matrix cookbook , author=. Technical University of Denmark , year=

  56. [59]

    Transactions on Machine Learning Research , year=

    The Low-Rank Simplicity Bias in Deep Networks , author=. Transactions on Machine Learning Research , year=

  57. [61]

    International Conference on Learning Representations , year=

    Learning Continually by Spectral Regularization , author=. International Conference on Learning Representations , year=

  58. [62]

    Advances in Neural Information Processing Systems , year =

    Towards Deeper Deep Reinforcement Learning with Spectral Normalization , author =. Advances in Neural Information Processing Systems , year =

  59. [63]

    Abbas, Z., Zhao, R., Modayil, J., White, A., and Machado, M. C. Loss of plasticity in continual deep reinforcement learning. In Conference on lifelong learning agents, 2023

  60. [64]

    P., and Weinberger, K

    Bjorck, N., Gomes, C. P., and Weinberger, K. Q. Towards deeper deep reinforcement learning with spectral normalization. In Advances in Neural Information Processing Systems, 2021

  61. [65]

    Practical gauss-newton optimisation for deep learning

    Botev, A., Ritter, H., and Barber, D. Practical gauss-newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning, 2017

  62. [66]

    Parseval regularization for continual reinforcement learning

    Chung, W., Cherif, L., Meger, D., and Precup, D. Parseval regularization for continual reinforcement learning. In Advances in Neural Information Processing Systems, 2024

  63. [67]

    Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models

    Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  64. [68]

    F., Lan, Q., Rahman, P., Mahmood, A

    Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 2024

  65. [69]

    Golub, G. H. and Van Loan, C. F. Matrix computations. JHU press, 2013

  66. [70]

    and Martens, J

    Grosse, R. and Martens, J. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, 2016

  67. [71]

    Spectral collapse drives loss of plasticity in deep continual learning

    He, N., Guo, K., Prakash, A., Tiwari, S., Tao, R. Y., Serapio, T., Greenwald, A., and Konidaris, G. Spectral collapse drives loss of plasticity in deep continual learning. arXiv preprint arXiv:2509.22335, 2025

  68. [72]

    Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge University Press, 2012

  69. [73]

    Huang, R. et al. Moe-loco: Mixture of experts for multitask locomotion. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2025 a

  70. [74]

    Huang, S. et al. Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning. In International Conference on Machine Learning, 2025 b

  71. [75]

    The low-rank simplicity bias in deep networks

    Huh, M., Mobahi, H., Zhang, R., Cheung, B., Agrawal, P., and Isola, P. The low-rank simplicity bias in deep networks. Transactions on Machine Learning Research, 2023

  72. [76]

    Igl, M. et al. Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Machine Learning, 2021

  73. [77]

    and Ash, J

    Juliani, A. and Ash, J. T. A study of plasticity loss in on-policy deep reinforcement learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  74. [78]

    Implicit under-parameterization inhibits data-efficient deep reinforcement learning

    Kumar, A., Agarwal, R., Ghosh, D., and Levine, S. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In International Conference on Learning Representations, 2021

  75. [79]

    S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J

    Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in Neural Information Processing Systems, 32, 2019

  76. [80]

    Lewandowski, A., Tanaka, H., Schuurmans, D., and Machado, M. C. Directions of curvature as an explanation for loss of plasticity. arXiv preprint arXiv:2312.00246, 2023

  77. [81]

    Lewandowski, A., Bortkiewicz, M., Kumar, S., Gy \"o rgy, A., Schuurmans, D., Ostaszewski, M., and Machado, M. C. Learning continually by spectral regularization. In International Conference on Learning Representations, 2025

  78. [82]

    Hadamard, khatri-rao, kronecker and other matrix products

    Liu, S., Trenkler, G., et al. Hadamard, khatri-rao, kronecker and other matrix products. International Journal of Information and Systems Sciences, 2008

  79. [83]

    Livingston, R. B. Brain mechanisms in conditioning and learning. Technical report, Office of Naval Research, 1966

  80. [84]

    Understanding and preventing capacity loss in reinforcement learning

    Lyle, C., Rowland, M., and Dabney, W. Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2022

Showing first 80 references.