Stagnant Neuron: Towards Understanding the Plasticity Loss in Multi-Agent Reinforcement Learning Value Factorization Methods

Cheng Wang; Chennan Ma; Haipeng Zhang; Haoyuan Qin; Jiawei Hu; Junhao Wu; Miao Zhu; Siqi Shen; Zeming Gao; Zhengzhu Liu

arxiv: 2606.25335 · v1 · pith:N7FD45PWnew · submitted 2026-06-24 · 💻 cs.LG

Stagnant Neuron: Towards Understanding the Plasticity Loss in Multi-Agent Reinforcement Learning Value Factorization Methods

Zhengzhu Liu , Zeming Gao , Haoyuan Qin , Jiawei Hu , Junhao Wu , Miao Zhu , Haipeng Zhang , Chennan Ma

show 2 more authors

Siqi Shen Cheng Wang

This is my paper

Pith reviewed 2026-06-25 21:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-agent reinforcement learningvalue factorizationplasticity lossstagnant neuronsKNIFEtransfer learningSMACv2

0 comments

The pith

KNIFE restores learning capacity in multi-agent RL value factorization by directly replacing stagnant neurons whose gradients have shrunk to near zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies stagnant neurons—units whose weight updates become negligibly small—as the source of plasticity loss when multi-agent value factorization methods transfer to new task instances. Existing plasticity injection techniques fail against these neurons because they do not isolate the units whose gradients have collapsed. KNIFE addresses the problem by swapping each stagnant neuron for a three-part composite: a frozen copy that keeps previously learned cooperation knowledge, a freshly initialized neuron that regains the ability to learn, and a compensation neuron that keeps the overall output unchanged so the factorization structure is preserved. Experiments on SMACv2, predator-prey, and matrix games show the method outperforms prior plasticity-injection baselines. The central premise is that preserving acquired knowledge while resetting only the inert units is sufficient to recover adaptability without retraining from scratch.

Core claim

Stagnant neurons cause plasticity loss in MARL value factorization; KNIFE replaces each such neuron with a composite unit consisting of a frozen knowledge neuron, a re-initialized active neuron, and a compensation neuron whose combined output matches the original, thereby restoring learning capacity while preserving cooperation knowledge.

What carries the argument

KNIFE composite replacement: a three-neuron module (frozen knowledge neuron + re-initialized active neuron + compensation neuron) that substitutes for each identified stagnant neuron while keeping the value factorization output unchanged.

If this is right

Value-factorization networks can continue adapting after task transfer without full retraining.
Gradient-collapse detection becomes a practical diagnostic for when plasticity has been lost.
The same neuron-level replacement logic could be applied to other cooperative MARL architectures that rely on centralized training.
Knowledge preservation no longer requires freezing entire layers or replay buffers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If stagnant neurons appear in single-agent RL as well, the same composite replacement could be tested outside multi-agent settings.
The compensation neuron might be replaceable by a simple scaling factor if the output-matching requirement can be enforced at the layer level instead of per neuron.
Measuring the fraction of stagnant neurons over training could serve as an early-warning signal for when transfer performance will degrade.

Load-bearing premise

The composite replacement keeps previously learned cooperation knowledge intact and does not create new interference inside the factorization.

What would settle it

Run the same SMACv2, predator-prey, and matrix-game transfer protocols; if KNIFE no longer outperforms the prior plasticity-injection baselines, the claim that the composite replacement solves stagnant-neuron plasticity loss is false.

Figures

Figures reproduced from arXiv: 2606.25335 by Cheng Wang, Chennan Ma, Haipeng Zhang, Haoyuan Qin, Jiawei Hu, Junhao Wu, Miao Zhu, Siqi Shen, Zeming Gao, Zhengzhu Liu.

**Figure 2.** Figure 2: Plasticity loss in modified SMACv2 with sequential tasks. (a) Overall win rate for all [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Stagnant neurons are prevalent across environments, and are consistently more concentrated [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The ratio of the Stagnant, Dormant, and GraMa neurons in (a) 3s_vs_5z of SMAC and (b) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: KNIFE creates a knowledge, an active, and a compensation neuron for a stagnant neuron. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Top Row: The win rate for task-drift setting (SMACv2 5gen_protoss, a), task-drift setting [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Neuron Analysis and Ablation Study. (a) Neuron-level methods work better based on [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: KNIFE can reduce the plasticity loss for QPLEX and QMIX in (a) task-drift (SMACv2 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Multi-Agent Reinforcement Learning (MARL) value factorization methods can suffer from a loss of plasticity, gradually failing to adapt when transferring to new task instances. We trace this issue to stagnant neurons, units whose gradient updates become negligibly small relative to their weights, thereby hindering learning. While existing plasticity injection methods exist, they prove ineffective for such neurons. To address this, we propose Knowledge-retentive Neuron-level PlastIcity Focusing InjEction (KNIFE), a novel method that directly targets stagnant neurons. KNIFE replaces each stagnant neuron with a composite unit comprising three specialized components: a frozen knowledge neuron to preserve acquired knowledge, a re-initialized active neuron to restore learning capacity, and a compensation neuron to ensure the combined output matches the original, thus maintaining previous learned cooperation knowledge. Extensive experiments on SMACv2, predator-prey, and matrix games demonstrate that KNIFE significantly outperforms state-of-the-art plasticity injection methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies stagnant neurons as the driver of plasticity loss in MARL value factorization and offers KNIFE as a direct replacement method, but the abstract supplies no numbers, detection rules, or ablations to back the outperformance claim.

read the letter

The core contribution is a diagnosis that plasticity loss during transfer comes from neurons whose gradients become negligible relative to their weights, plus a targeted fix called KNIFE. Instead of generic plasticity injection, it swaps each stagnant neuron for three parts: a frozen knowledge neuron, a freshly initialized active neuron, and a compensation neuron whose job is to keep the combined output identical to the original. That last step is meant to protect already-learned cooperation without adding interference.

What stands out is the neuron-level granularity and the explicit attempt to preserve prior value-factorization knowledge while restoring learning capacity. The experiments are run on SMACv2, predator-prey, and matrix games, which are standard for this subfield, and the method is positioned against existing plasticity-injection baselines.

The soft spots are straightforward. The abstract gives no quantitative results, no precise rule for labeling a neuron as stagnant, and no ablation data on whether the compensation step actually avoids new interference. Without those, the claim that KNIFE significantly outperforms prior methods cannot be checked. The central assumption—that the composite unit preserves cooperation knowledge exactly—remains untested in the visible text.

This is work for people already inside MARL value factorization who care about transfer and plasticity. It is not yet ready for a broad audience. The idea is concrete enough and the problem is real enough that a serious editor should send it to referees rather than desk-reject it; the authors will need to supply the missing numbers and checks before it can be evaluated properly.

Referee Report

2 major / 0 minor

Summary. The paper claims that plasticity loss in MARL value factorization methods arises from stagnant neurons (units whose gradient updates become negligibly small relative to their weights). It introduces KNIFE, which replaces each such neuron with a composite of a frozen knowledge neuron (to retain acquired knowledge), a re-initialized active neuron (to restore learning capacity), and a compensation neuron (to ensure the combined output matches the original). Experiments on SMACv2, predator-prey, and matrix games are reported to show that KNIFE significantly outperforms existing plasticity injection methods.

Significance. If the results hold under rigorous verification, the work could provide a targeted intervention for restoring adaptability in cooperative MARL without disrupting prior value factorization, addressing a practical barrier in transfer settings.

major comments (2)

[Abstract] Abstract: the claim that KNIFE 'significantly outperforms state-of-the-art plasticity injection methods' is made without any quantitative metrics, identification criteria for stagnant neurons, or ablation details; the central empirical claim therefore rests on unreported experiments.
[Method] Method description: the assertion that the composite replacement (frozen knowledge neuron + re-initialized active neuron + compensation neuron) exactly preserves previously learned cooperation knowledge without introducing new interference is load-bearing for the method's validity, yet no derivation, output-matching proof, or empirical verification of this property is supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that KNIFE 'significantly outperforms state-of-the-art plasticity injection methods' is made without any quantitative metrics, identification criteria for stagnant neurons, or ablation details; the central empirical claim therefore rests on unreported experiments.

Authors: The abstract is intentionally concise per conference guidelines, but the full manuscript reports quantitative results (e.g., win-rate and return improvements on SMACv2, predator-prey, and matrix games) with tables and statistical significance in Section 4. Stagnant-neuron identification uses the gradient-to-weight ratio threshold defined in Section 3.2, and ablations appear in Section 4.3. We will revise the abstract to include one or two key quantitative highlights and a brief reference to the identification criterion. revision: yes
Referee: [Method] Method description: the assertion that the composite replacement (frozen knowledge neuron + re-initialized active neuron + compensation neuron) exactly preserves previously learned cooperation knowledge without introducing new interference is load-bearing for the method's validity, yet no derivation, output-matching proof, or empirical verification of this property is supplied.

Authors: Section 3.3 derives the compensation neuron explicitly: given the additive decomposition in value factorization, the compensation weights are solved so that the sum of the three component outputs equals the original neuron output at the moment of replacement. This is a direct algebraic identity under the linear mixing assumption used by the methods considered. Empirical verification is provided via source-task performance retention experiments in Section 4.2. We will expand the derivation with an explicit equation and a short proof sketch in the revision for clarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical method (KNIFE) that procedurally replaces identified stagnant neurons with a three-part composite unit whose compensation component is explicitly constructed to match prior outputs. No equations, fitted parameters renamed as predictions, self-definitional derivations, or load-bearing self-citations appear in the abstract or described claims. The central result is an experimental performance comparison on SMACv2, predator-prey, and matrix games rather than a closed-form derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the conceptual definition of stagnant neurons.

invented entities (1)

stagnant neuron no independent evidence
purpose: unit whose gradient updates become negligibly small relative to weights, causing plasticity loss
Introduced in abstract as the traced cause of the observed failure mode; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5724 in / 1106 out tokens · 21279 ms · 2026-06-25T21:20:03.334781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 1 linked inside Pith

[1]

Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. InAAMAS, pages 750–797, 2019. URL http://arxiv.org/abs/1810.05587

arXiv 2019
[2]

Foerster, and Shimon Whiteson

Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. InICML, pages 4292–4301, 2018

2018
[3]

Meal: A benchmark for continual multi-agent reinforcement learning

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, and Mykola Pechenizkiy. Meal: A benchmark for continual multi-agent reinforcement learning. arXiv preprint arXiv:2506.14990, 2025

Pith/arXiv arXiv 2025
[4]

Understanding plasticity in neural networks

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. InICML, pages 23190–23211, 2023

2023
[5]

Deep reinforcement learning with plasticity injection

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection. InNeurIPS, 2023

2023
[6]

Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024

2024
[7]

The dormant neuron phenomenon in multi-agent reinforcement learning value factorization

Haoyuan Qin, Chennan Ma, Mian Deng, Zhengzhu Liu, Songzhu Mei, Xinwang Liu, Cheng Wang, and Siqi Shen. The dormant neuron phenomenon in multi-agent reinforcement learning value factorization. InNeurIPS, 2024

2024
[8]

Qplex: Duplex dueling multi-agent q-learning

Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. InICLR, 2021

2021
[9]

Foerster, and Shimon Whiteson

Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob N. Foerster, and Shimon Whiteson. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. InNeurIPS, 2023

2023
[10]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InICML, pages 32145–32168, 2023

2023
[11]

Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061, 2025

Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061, 2025

arXiv 2025
[12]

Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs

Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer Briefs in Intelligent Systems. Springer, 2016

2016
[13]

A definition of continual reinforcement learning.Advances in Neural Information Processing Systems, 36:50377–50407, 2023

David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado P van Hasselt, and Satinder Singh. A definition of continual reinforcement learning.Advances in Neural Information Processing Systems, 36:50377–50407, 2023

2023
[14]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[15]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

2017
[16]

Memory aware synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

2018
[17]

Learning continually by spectral regularization

Alex Lewandowski, Michał Bortkiewicz, Saurabh Kumar, András György, Dale Schuurmans, Mateusz Ostaszewski, and Marlos C Machado. Learning continually by spectral regularization. arXiv preprint arXiv:2406.06811, 2024. 10

arXiv 2024
[18]

Packnet: Adding multiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

2018
[19]

Neuroplastic expansion in deep reinforcement learning.arXiv preprint arXiv:2410.07994, 2024

Jiashun Liu, Johan Obando-Ceron, Aaron Courville, and Ling Pan. Neuroplastic expansion in deep reinforcement learning.arXiv preprint arXiv:2410.07994, 2024

arXiv 2024
[20]

Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting.Neurocomputing, 428: 291–307, 2021

Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting.Neurocomputing, 428: 291–307, 2021

2021
[21]

Experience replay addresses loss of plasticity in continual learning.arXiv preprint arXiv:2503.20018, 2025

Jiuqi Wang, Rohan Chandra, and Shangtong Zhang. Experience replay addresses loss of plasticity in continual learning.arXiv preprint arXiv:2503.20018, 2025

arXiv 2025
[22]

Retaining suboptimal actions to follow shifting optima in multi-agent reinforcement learning

Yonghyeon Jo, Sunwoo Lee, and Seungyul Han. Retaining suboptimal actions to follow shifting optima in multi-agent reinforcement learning. InICLR, 2026

2026
[23]

Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325, 2021

Shibhansh Dohare, Richard S Sutton, and A Rupam Mahmood. Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325, 2021

arXiv 2021
[24]

Courville

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron C. Courville. The primacy bias in deep reinforcement learning. InICML, pages 16828–16847, 2022. URL https://proceedings.mlr.press/v162/nikishin22a.html

2022
[25]

Sample-efficient multiagent re- inforcement learning with reset replay

Yaodong Yang, Guangyong Chen, Pheng-Ann Heng, et al. Sample-efficient multiagent re- inforcement learning with reset replay. InForty-first international conference on machine learning, 2024

2024
[26]

Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning. InNeurIPS, 2020

2020
[27]

Resq: A residual q function-based approach for multi-agent reinforcement learning value factorization

Siqi Shen, Mengwei Qiu, Jun Liu, Weiquan Liu, Yongquan Fu, Xinwang Liu, and Cheng Wang. Resq: A residual q function-based approach for multi-agent reinforcement learning value factorization. InNeurIPS, 2022

2022
[28]

Riskq: Risk-sensitive multi-agent reinforcement learning value factorization

Siqi Shen, Chennan Ma, Chao Li, Weiquan Liu, Yongquan Fu, Songzhu Mei, Xinwang Liu, and Cheng Wang. Riskq: Risk-sensitive multi-agent reinforcement learning value factorization. In NeurIPS, 2023. URLhttps://openreview.net/forum?id=FskZtRvMJI

2023
[29]

Qatten: A general framework for cooperative multiagent reinforcement learning.CoRR,

Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning.CoRR,
[30]

URLhttps://arxiv.org/abs/2002.03939

arXiv 2002
[31]

Updet: Universal multi-agent reinforcement learning via policy decoupling with transformers

Siyi Hu, Fengda Zhu, Xiaojun Chang, and Xiaodan Liang. Updet: Universal multi-agent reinforcement learning via policy decoupling with transformers. InICLR, 2021

2021
[32]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InNeurIPS, pages 6379–6390, 2017

2017
[33]

The surprising effectiveness of ppo in cooperative multi-agent games.NeurIPS, 35:24611– 24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.NeurIPS, 35:24611– 24624, 2022

2022
[34]

Graph convolutional reinforcement learning

Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. Graph convolutional reinforcement learning. InICLR, 2020

2020
[35]

Mikayel Samvelyan, Tabish Rashid, Christian Schröder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob N. Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. InAAMAS, pages 2186–2188, 2019

2019
[36]

Dai, and Quoc V

David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. InICLR, 2017. 11

2017
[37]

On warm-starting neural network training.NeurIPS, 33: 3884–3894, 2020

Jordan Ash and Ryan P Adams. On warm-starting neural network training.NeurIPS, 33: 3884–3894, 2020

2020
[38]

Leibo, Karl Tuyls, and Thore Graepel

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team reward. InAAMAS, pages 2085–2087, 2018

2085
[39]

QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning

Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Hostallero, and Yung Yi. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In ICML, pages 5887–5896, 2019

2019
[40]

−10−10 12 −9−12−8 −4−8−11 # , M 2 =

Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. DFAC framework: Factorizing the value function via quantile mixture for multi-agent distributional q-learning. InICML, pages 9945–9954, 2021. 12 A Background A.1 Value Function Factorization In value factorization methods [ 2, 37, 8, 38], per-agent utilities Qi are approximated byagent networks, and then mix...

2021

[1] [1]

Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. InAAMAS, pages 750–797, 2019. URL http://arxiv.org/abs/1810.05587

arXiv 2019

[2] [2]

Foerster, and Shimon Whiteson

Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. InICML, pages 4292–4301, 2018

2018

[3] [3]

Meal: A benchmark for continual multi-agent reinforcement learning

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, and Mykola Pechenizkiy. Meal: A benchmark for continual multi-agent reinforcement learning. arXiv preprint arXiv:2506.14990, 2025

Pith/arXiv arXiv 2025

[4] [4]

Understanding plasticity in neural networks

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. InICML, pages 23190–23211, 2023

2023

[5] [5]

Deep reinforcement learning with plasticity injection

Evgenii Nikishin, Junhyuk Oh, Georg Ostrovski, Clare Lyle, Razvan Pascanu, Will Dabney, and André Barreto. Deep reinforcement learning with plasticity injection. InNeurIPS, 2023

2023

[6] [6]

Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632 (8026):768–774, 2024

2024

[7] [7]

The dormant neuron phenomenon in multi-agent reinforcement learning value factorization

Haoyuan Qin, Chennan Ma, Mian Deng, Zhengzhu Liu, Songzhu Mei, Xinwang Liu, Cheng Wang, and Siqi Shen. The dormant neuron phenomenon in multi-agent reinforcement learning value factorization. InNeurIPS, 2024

2024

[8] [8]

Qplex: Duplex dueling multi-agent q-learning

Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. InICLR, 2021

2021

[9] [9]

Foerster, and Shimon Whiteson

Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob N. Foerster, and Shimon Whiteson. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. InNeurIPS, 2023

2023

[10] [10]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InICML, pages 32145–32168, 2023

2023

[11] [11]

Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061, 2025

Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Ling Pan. Measure gradients, not activations! enhancing neuronal activity in deep reinforcement learning.arXiv preprint arXiv:2505.24061, 2025

arXiv 2025

[12] [12]

Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs

Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer Briefs in Intelligent Systems. Springer, 2016

2016

[13] [13]

A definition of continual reinforcement learning.Advances in Neural Information Processing Systems, 36:50377–50407, 2023

David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado P van Hasselt, and Satinder Singh. A definition of continual reinforcement learning.Advances in Neural Information Processing Systems, 36:50377–50407, 2023

2023

[14] [14]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[15] [15]

Continual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InInternational conference on machine learning, pages 3987–3995. PMLR, 2017

2017

[16] [16]

Memory aware synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuyte- laars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

2018

[17] [17]

Learning continually by spectral regularization

Alex Lewandowski, Michał Bortkiewicz, Saurabh Kumar, András György, Dale Schuurmans, Mateusz Ostaszewski, and Marlos C Machado. Learning continually by spectral regularization. arXiv preprint arXiv:2406.06811, 2024. 10

arXiv 2024

[18] [18]

Packnet: Adding multiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

2018

[19] [19]

Neuroplastic expansion in deep reinforcement learning.arXiv preprint arXiv:2410.07994, 2024

Jiashun Liu, Johan Obando-Ceron, Aaron Courville, and Ling Pan. Neuroplastic expansion in deep reinforcement learning.arXiv preprint arXiv:2410.07994, 2024

arXiv 2024

[20] [20]

Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting.Neurocomputing, 428: 291–307, 2021

Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting.Neurocomputing, 428: 291–307, 2021

2021

[21] [21]

Experience replay addresses loss of plasticity in continual learning.arXiv preprint arXiv:2503.20018, 2025

Jiuqi Wang, Rohan Chandra, and Shangtong Zhang. Experience replay addresses loss of plasticity in continual learning.arXiv preprint arXiv:2503.20018, 2025

arXiv 2025

[22] [22]

Retaining suboptimal actions to follow shifting optima in multi-agent reinforcement learning

Yonghyeon Jo, Sunwoo Lee, and Seungyul Han. Retaining suboptimal actions to follow shifting optima in multi-agent reinforcement learning. InICLR, 2026

2026

[23] [23]

Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325, 2021

Shibhansh Dohare, Richard S Sutton, and A Rupam Mahmood. Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325, 2021

arXiv 2021

[24] [24]

Courville

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron C. Courville. The primacy bias in deep reinforcement learning. InICML, pages 16828–16847, 2022. URL https://proceedings.mlr.press/v162/nikishin22a.html

2022

[25] [25]

Sample-efficient multiagent re- inforcement learning with reset replay

Yaodong Yang, Guangyong Chen, Pheng-Ann Heng, et al. Sample-efficient multiagent re- inforcement learning with reset replay. InForty-first international conference on machine learning, 2024

2024

[26] [26]

Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning. InNeurIPS, 2020

2020

[27] [27]

Resq: A residual q function-based approach for multi-agent reinforcement learning value factorization

Siqi Shen, Mengwei Qiu, Jun Liu, Weiquan Liu, Yongquan Fu, Xinwang Liu, and Cheng Wang. Resq: A residual q function-based approach for multi-agent reinforcement learning value factorization. InNeurIPS, 2022

2022

[28] [28]

Riskq: Risk-sensitive multi-agent reinforcement learning value factorization

Siqi Shen, Chennan Ma, Chao Li, Weiquan Liu, Yongquan Fu, Songzhu Mei, Xinwang Liu, and Cheng Wang. Riskq: Risk-sensitive multi-agent reinforcement learning value factorization. In NeurIPS, 2023. URLhttps://openreview.net/forum?id=FskZtRvMJI

2023

[29] [29]

Qatten: A general framework for cooperative multiagent reinforcement learning.CoRR,

Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning.CoRR,

[30] [30]

URLhttps://arxiv.org/abs/2002.03939

arXiv 2002

[31] [31]

Updet: Universal multi-agent reinforcement learning via policy decoupling with transformers

Siyi Hu, Fengda Zhu, Xiaojun Chang, and Xiaodan Liang. Updet: Universal multi-agent reinforcement learning via policy decoupling with transformers. InICLR, 2021

2021

[32] [32]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InNeurIPS, pages 6379–6390, 2017

2017

[33] [33]

The surprising effectiveness of ppo in cooperative multi-agent games.NeurIPS, 35:24611– 24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.NeurIPS, 35:24611– 24624, 2022

2022

[34] [34]

Graph convolutional reinforcement learning

Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. Graph convolutional reinforcement learning. InICLR, 2020

2020

[35] [35]

Mikayel Samvelyan, Tabish Rashid, Christian Schröder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob N. Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. InAAMAS, pages 2186–2188, 2019

2019

[36] [36]

Dai, and Quoc V

David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. InICLR, 2017. 11

2017

[37] [37]

On warm-starting neural network training.NeurIPS, 33: 3884–3894, 2020

Jordan Ash and Ryan P Adams. On warm-starting neural network training.NeurIPS, 33: 3884–3894, 2020

2020

[38] [38]

Leibo, Karl Tuyls, and Thore Graepel

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning based on team reward. InAAMAS, pages 2085–2087, 2018

2085

[39] [39]

QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning

Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Hostallero, and Yung Yi. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In ICML, pages 5887–5896, 2019

2019

[40] [40]

−10−10 12 −9−12−8 −4−8−11 # , M 2 =

Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. DFAC framework: Factorizing the value function via quantile mixture for multi-agent distributional q-learning. InICML, pages 9945–9954, 2021. 12 A Background A.1 Value Function Factorization In value factorization methods [ 2, 37, 8, 38], per-agent utilities Qi are approximated byagent networks, and then mix...

2021