Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage

Biao Zhao; Hongyang Du; Jinhu Qi; Junli Wang; Shixiang Tang; Weiqiang Jin; Wentao Zhang; Yang Liu

arxiv: 2506.07548 · v2 · submitted 2025-06-09 · 💻 cs.AI · cs.RO

Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage

Weiqiang Jin , Yang Liu , Shixiang Tang , Jinhu Qi , Wentao Zhang , Junli Wang , Biao Zhao , Hongyang Du This is my paper

Pith reviewed 2026-05-19 11:05 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords multi-agent reinforcement learningcurriculum learningadaptive difficultycounterfactual advantageStarCraft Multi-Agent Challengeenvironmental meta-stationaritygroup relative policy optimization

0 comments

The pith

Adaptive curriculum driven by win-rate signals and counterfactual group advantages overcomes fixed-difficulty limits in multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that training multi-agent reinforcement learning agents against opponents at a single fixed difficulty creates environmental meta-stationarity, which restricts how well policies generalize and pushes learning into shallow local optima. To address this, the authors introduce CL-MARL, a framework that dynamically adjusts the difficulty of the task online based on the agents' win rates. This is paired with a new optimization method called Counterfactual Group Relative Policy Advantage that helps separate each agent's contribution despite the changing team dynamics. If correct, this would allow agents to build skills more progressively and achieve better performance on challenging cooperative tasks like those in StarCraft.

Core claim

The paper claims that static-difficulty training in MARL creates environmental meta-stationarity that caps generalization and steers learning to shallow optima, and that this can be overcome with CL-MARL, which adapts opponent strength online from win-rate signals via the FlexDiff scheduler fusing momentum-based estimation with sliding-window dual-curve monitoring, and introduces the Counterfactual Group Relative Policy Advantage to disentangle agent contributions under shifting dynamics.

What carries the argument

The FlexDiff scheduler for adaptive difficulty transitions based on win-rate signals together with the Counterfactual Group Relative Policy Advantage (CGRPA) for optimization in non-stationary team settings.

If this is right

CL-MARL achieves a 40% mean win rate on super-hard SMAC maps with an average episode return of 17.85.
It outperforms QMIX, OW-QMIX, DER, EMC, and MARR baselines by an average of +2.94.
Peak win rates are reached approximately 1.28 times faster on the 8m_vs_9m map and 1.42 times faster on the 3s5z_vs_3s6z map compared to the strongest baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This adaptive approach could be tested in other MARL environments where difficulty can be parameterized to see if similar gains in generalization occur.
The reliance on win rates suggests that alternative performance metrics might be needed for tasks without binary win/loss outcomes.
Extending the counterfactual mechanism might help address non-stationarity in single-agent RL with changing environments as well.

Load-bearing premise

Win-rate signals provide a sufficiently clean and timely indicator of agent mastery that can be used to drive stable difficulty transitions without introducing excessive noise or requiring per-map manual tuning.

What would settle it

Running the method on the super-hard SMAC maps and finding that the difficulty level oscillates frequently without leading to improved final win rates over fixed-difficulty training would indicate that the adaptive curriculum does not reliably overcome meta-stationarity.

Figures

Figures reproduced from arXiv: 2506.07548 by Biao Zhao, Hongyang Du, Jinhu Qi, Junli Wang, Shixiang Tang, Weiqiang Jin, Wentao Zhang, Yang Liu.

**Figure 1.** Figure 1: When CL Meets MARL: A comparative illustration of training paradigms between CL-based supervised learning and CL-based MARL. Part 1 and Part 2 represent the traditional CL framework for supervised learning (e.g., CV, NLP) with labeled data and its optimization trajectories (supervision loss minimization), and Part 3 and Part 4 are the CL-based MARL training framework in the SMAC adversarial environment (2s… view at source ↗

**Figure 2.** Figure 2: The architecture of our proposed CL-based MARL framework for cooperative-adversarial scenarios in SMAC. The [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The detailed visualization of our proposed CGRPA [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The typical SMAC benchmark scenarios and maps [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Baseline Comparisons of the six easy maps in the SMAC benchmark. The x-axis represents the global training steps, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Baseline Comparisons of the eight maps of [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Baseline Comparisons of the eight maps of [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Performance Comparison of Convergence Risks In [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparative Analysis on Normalized Win Rates of [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 11-12.** Figure 11-12: It was evident that, under the extended simulation [PITH_FULL_IMAGE:figures/full_fig_p013_11-12.png] view at source ↗

**Figure 11.** Figure 11: Example of the Remaining Survival Status of Units [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Example of the Remaining Survival Status of Units in [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

read the original abstract

Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CL-MARL tries an online adaptive curriculum for MARL on SMAC using win-rate signals plus a counterfactual group advantage fix, but the scheduler stability on sparse rewards looks like the part to watch.

read the letter

The main point is that this work replaces fixed-difficulty training in cooperative MARL with an online scheduler that raises or lowers opponent strength based on observed win rates and returns. They pair it with a new advantage estimator to keep credit assignment workable when the environment keeps shifting. On the super-hard SMAC maps they report a 40% mean win rate and faster peak performance than QMIX and a few other baselines, with code released on GitHub. That combination of adaptive curriculum and the counterfactual extension of group-relative optimization is the concrete addition relative to the baselines they cite. The scheduler itself, FlexDiff, uses momentum tracking and dual-curve monitoring of training versus evaluation curves, which is a reasonable engineering choice for avoiding manual per-map tuning. The empirical numbers are stated plainly and the speed-up claims are specific enough to be checked. The central practical claim holds up in the reported results: moving away from static difficulty does appear to improve final performance and convergence speed on these maps. The soft spot is exactly the one the stress-test note flags. On maps where average win rate sits around 40%, episode outcomes are mostly losses or near-zero returns, so any sliding-window signal will carry high variance and long zero stretches. Nothing in the abstract shows variance plots for the scheduler, sensitivity checks, or failure cases where the difficulty oscillates or stalls. If that happens, the method could re-introduce the meta-stationarity it aims to remove. The abstract also omits run counts, standard errors, or ablation results on the scheduler alone, which leaves the strength of the central claim harder to judge from the summary alone. This paper is aimed at MARL researchers working on generalization, curricula, or multi-robot coordination tasks. Readers who care about practical training tricks on SMAC-style benchmarks will find usable ideas here even if they end up modifying the scheduler. It is coherent on its own terms and shows honest engagement with the non-stationarity issue, so it clears the bar for a serious referee. I would send it out for review rather than desk-reject; the empirical direction is worth testing and the reviewers can press on the variance and ablation gaps.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that fixed-difficulty training in MARL induces environmental meta-stationarity, which limits generalization and leads to shallow optima. It proposes CL-MARL, which uses the FlexDiff scheduler to dynamically adjust opponent difficulty from win-rate and return signals via momentum estimation and dual-curve sliding-window monitoring, together with the Counterfactual Group Relative Policy Advantage (CGRPA) estimator to mitigate the resulting non-stationarity. On SMAC super-hard maps the method is reported to reach a 40% mean win rate and average return of 17.85, outperforming QMIX, OW-QMIX, DER, EMC and MARR by +2.94 on average while attaining peak performance 1.28–1.42 times faster on two maps.

Significance. If the reported gains prove robust, the work would be significant for practical MARL by showing that online curriculum adaptation can improve both final performance and sample efficiency on cooperative tasks against scripted opponents. The public implementation strengthens reproducibility. The combination of a scheduler driven by sparse binary signals and a counterfactual group advantage estimator addresses a concrete training pathology, though the magnitude of improvement remains modest relative to the added complexity.

major comments (2)

[Abstract / §4] Abstract and experimental results: the concrete performance claims (40% win rate, +2.94 average improvement, 1.28× and 1.42× speed-ups) are presented without any indication of run count, standard deviation, confidence intervals or statistical tests. This information is load-bearing for the central empirical claim that CL-MARL exceeds the listed baselines.
[§3.2] §3.2 (FlexDiff scheduler): the claim that momentum-based trend estimation plus dual-curve monitoring yields “stable difficulty transitions without manual tuning” is central to overcoming meta-stationarity, yet no variance plots, sensitivity analysis to window size, or failure-case statistics are supplied for maps whose mean win rate is only ~40%. On such maps individual episodes produce sparse binary signals, raising the risk that the scheduler oscillates or stalls exactly as the skeptic note anticipates.

minor comments (2)

[Abstract] Abstract: “1.28faster” is missing a space and is inconsistent with the subsequent “1.42 faster”; standardize phrasing.
[§3.3] Notation for CGRPA: the extension of GRPO with counterfactual baselines would benefit from an explicit equation defining the advantage estimator before the empirical section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the empirical support and analysis in the manuscript.

read point-by-point responses

Referee: [Abstract / §4] Abstract and experimental results: the concrete performance claims (40% win rate, +2.94 average improvement, 1.28× and 1.42× speed-ups) are presented without any indication of run count, standard deviation, confidence intervals or statistical tests. This information is load-bearing for the central empirical claim that CL-MARL exceeds the listed baselines.

Authors: We agree that statistical rigor is necessary to substantiate the reported performance gains. In the revised manuscript we will explicitly state that all results are averaged over 5 independent random seeds, report standard deviations alongside the mean win rates and returns, include 95% confidence intervals, and add paired t-test p-values comparing CL-MARL against each baseline. These details will be added to the abstract, Section 4, and the corresponding tables and figures. revision: yes
Referee: [§3.2] §3.2 (FlexDiff scheduler): the claim that momentum-based trend estimation plus dual-curve monitoring yields “stable difficulty transitions without manual tuning” is central to overcoming meta-stationarity, yet no variance plots, sensitivity analysis to window size, or failure-case statistics are supplied for maps whose mean win rate is only ~40%. On such maps individual episodes produce sparse binary signals, raising the risk that the scheduler oscillates or stalls exactly as the skeptic note anticipates.

Authors: We acknowledge that the current manuscript lacks explicit stability diagnostics for FlexDiff under sparse win-rate signals. In the revision we will add (i) variance plots of the estimated difficulty level and momentum trend over training for the super-hard maps, (ii) a sensitivity table varying the sliding-window sizes and momentum coefficient, and (iii) a short discussion of any observed oscillations or stalls together with the conditions under which they were mitigated. These additions will directly address the concern about robustness on maps with ~40% mean win rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical outcomes rather than self-referential derivations

full rationale

The paper defines environmental meta-stationarity as a new term for fixed-difficulty training, then proposes CL-MARL with FlexDiff (momentum + dual-curve win-rate monitoring) and CGRPA (counterfactual group-relative advantage). These are algorithmic constructions whose performance is reported as measured results on SMAC super-hard maps (40% mean win rate, +2.94 average return over baselines). No equation or scheduler step is shown to equal its own input by construction, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the central result. The reported speed-ups and win-rate gains are presented as experimental outcomes of the scheduler and estimator, not as identities derived from the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The method rests on standard MARL assumptions plus two new algorithmic pieces whose internal parameters and stability properties are not detailed in the abstract.

axioms (1)

domain assumption Win rate serves as a reliable proxy for mastery that can drive difficulty changes without destabilizing learning.
Invoked to justify the FlexDiff scheduler decisions.

invented entities (2)

FlexDiff scheduler no independent evidence
purpose: Online adaptation of opponent strength from win-rate signals
Fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns.
CGRPA no independent evidence
purpose: Disentangle individual agent contributions under non-stationary team dynamics
Extends GRPO-style group-relative optimization with counterfactual baselines.

pith-pipeline@v0.9.0 · 5837 in / 1448 out tokens · 46264 ms · 2026-05-19T11:05:18.504955+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FlexDiff fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CGRPA constructs a counterfactual advantage function that isolates individual contributions within group behavior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

[1]

A survey on multi-agent reinforcement learning and its application,

Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024

work page 2024
[3]

A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,

P. Yadav, A. Mishra, and S. Kim, “A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,” Sensors, vol. 23, no. 10, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/10/4710

work page 2023
[4]

Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,

Y . Zhang, G. Zheng, Z. Liu, Q. Li, and H. Zeng, “Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,” IEEE Transactions on Visualization and Computer Graphics , p. 1–16, 2024. [Online]. Available: http: //dx.doi.org/10.1109/TVCG.2024.3392587

work page doi:10.1109/tvcg.2024.3392587 2024
[5]

Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey,

R. Zhang, J. Hou, F. Walter, S. Gu, J. Guan, F. R ¨ohrbein, Y . Du, P. Cai, G. Chen, and A. Knoll, “Multi-agent reinforcement learning for autonomous driving: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2408.09675

work page arXiv 2024
[6]

A survey of multi-agent deep reinforcement learning with communication,

C. Zhu, M. Dastani, and S. Wang, “A survey of multi-agent deep reinforcement learning with communication,” Autonomous Agents and Multi-Agent Systems , vol. 38, no. 1, p. 4, 2024. [Online]. Available: https://doi.org/10.1007/s10458-023-09633-6

work page doi:10.1007/s10458-023-09633-6 2024
[7]

Deep reinforcement learning: A survey,

X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, B. Dai, and Q. Miao, “Deep reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 4, pp. 5064– 5078, 2024

work page 2024
[8]

Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,

T. Rashid, G. Farquhar, B. Peng, and S. Whiteson, “Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,” in Proceedings of the 34th International Con- ference on Neural Information Processing Systems , ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

work page 2020
[9]

Episodic multi-agent reinforcement learning with curiosity- driven exploration,

L. Zheng, J. Chen, J. Wang, J. He, Y . Hu, Y . Chen, C. Fan, Y . Gao, and C. Zhang, “Episodic multi-agent reinforcement learning with curiosity- driven exploration,” in Proceedings of the 35th International Conference on Neural Information Processing Systems , ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

work page 2021
[10]

Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,

A. Authors, “Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,” in International Conference on Learning Representations (ICLR) , 2024, oral Presentation. [Online]. Available: https://iclr.cc/virtual/2024/oral/19766

work page 2024
[12]

A survey of reinforcement learning algorithms for dynamically varying environments,

S. Padakandla, “A survey of reinforcement learning algorithms for dynamically varying environments,” ACM Comput. Surv., vol. 54, no. 6, Jul. 2021. [Online]. Available: https://doi.org/10.1145/3459991

work page doi:10.1145/3459991 2021
[13]

A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A survey of learning in multiagent environments: Dealing with non- stationarity,” 2019. [Online]. Available: https://arxiv.org/abs/1707.09183

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

The StarCraft Multi-Agent Challenge,

M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson, “The starcraft multi-agent challenge,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04043 JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15

work page arXiv 2019
[15]

SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,

B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. N. Foerster, and S. Whiteson, “SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , vol. 36, 2023, pp. 37 567–37 593. [Online]. Available: https://openreview....

work page 2023
[16]

A survey on curriculum learning,

X. Wang, Y . Chen, and W. Zhu, “A survey on curriculum learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 9, pp. 4555–4576, 2022

work page 2022
[17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Counterfactual multi-agent policy gradients,

J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” ser. AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018

work page 2018
[19]

Monotonic value function factorisation for deep multi- agent reinforcement learning,

T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,” J. Mach. Learn. Res. , vol. 21, no. 1, Jan. 2020

work page 2020
[20]

Discriminative experience replay for efficient multi- agent reinforcement learning,

H. Xunhan, Z. Jian, Z. Wengang, F. Ruili, and L. Houqiang, “Discriminative experience replay for efficient multi- agent reinforcement learning,” DeepAI, 2023. [Online]. Avail- able: https://deepai.org/publication/discriminative-experience-replay-/ for-efficient-multi-agent-reinforcement-learning

work page 2023
[21]

Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning

G. Papoudakis, F. Christianos, A. Rahman, and S. V . Albrecht, “Dealing with non-stationarity in multi-agent deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.04737

work page internal anchor Pith review Pith/arXiv arXiv 2019
[22]

Dealing with non- stationarity in MARL via trust-region decomposition,

W. Li, X. Wang, B. Jin, J. Sheng, and H. Zha, “Dealing with non- stationarity in MARL via trust-region decomposition,” in International Conference on Learning Representations , 2022. [Online]. Available: https://openreview.net/forum?id=XHUxf5aRB3s

work page 2022
[23]

Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,

J. Wei, L. Wang, X. Tao, H. Hu, and H. Wu, “Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,” in Web Information Systems and Applications , X. Zhao, S. Yang, X. Wang, and J. Li, Eds. Cham: Springer International Publishing, 2022, pp. 403–415

work page 2022
[24]

Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,

H. Nekoei, A. Badrinaaraayanan, A. Sinha, M. Amini, J. Rajendran, A. Mahajan, and S. Chandar, “Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,” in Proceedings of The 2nd Conference on Lifelong Learning Agents , ser. Proceedings of Machine Learning Research, S. Chandar, R. Pas...

work page 2023
[25]

Monotonic improvement guarantees under non- stationarity for decentralized PPO,

M. Sun, S. Devlin, J. A. Beck, K. Hofmann, and S. Whiteson, “Monotonic improvement guarantees under non- stationarity for decentralized PPO,” 2022. [Online]. Available: https://openreview.net/forum?id=uHv20yi8saL

work page 2022
[26]

Value-decomposition networks for cooperative multi-agent learning based on team reward,

P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proceedings of the 17th Interna- tional Conference on Autonomous Agents and MultiAgent Systems , ser. AAMAS ’18. Richland...

work page 2018
[27]

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning

K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y . Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019. [Online]. Available: https: //arxiv.org/abs/1905.05408

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

Team-wise effective communication in multi- agent reinforcement learning,

M. Yang, K. Zhao, Y . Wang, R. Dong, Y . Du, F. Liu, M. Zhou, and L. H. U, “Team-wise effective communication in multi- agent reinforcement learning,” Autonomous Agents and Multi-Agent Systems, vol. 38, no. 2, p. 36, 2024. [Online]. Available: https: //doi.org/10.1007/s10458-024-09665-6

work page doi:10.1007/s10458-024-09665-6 2024
[29]

Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,

X. Guo, D. Shi, and W. Fan, “Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, ser. IJCAI ’23, 2023

work page 2023
[30]

Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,

X. Wang, X. Li, J. Shao, and J. Zhang, “Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,” in Pro- ceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, ser. AAMAS ’23. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2023, p. 427–435

work page 2023
[31]

Communication in multi-agent reinforcement learning: Intention sharing,

W. Kim, J. Park, and Y . Sung, “Communication in multi-agent reinforcement learning: Intention sharing,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=qpsl2dR9twy

work page 2021
[32]

A sequen- tial multi-agent reinforcement learning framework for different action spaces,

S. Tian, M. Yang, R. Xiong, X. He, and S. Rajasegarar, “A sequen- tial multi-agent reinforcement learning framework for different action spaces,” Expert Systems with Applications , vol. 258, p. 125138, 2024

work page 2024
[33]

Addressing high-dimensional continuous action space via decomposed discrete policy-critic,

Y . Zhang, J. Sun, G. Wang, and J. Chen, “Addressing high-dimensional continuous action space via decomposed discrete policy-critic,” 2023. [Online]. Available: https://openreview.net/forum?id=blCpfjAeFkn

work page 2023
[34]

D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,

G. Calzolari, V . Sumathy, C. Kanellakis, and G. Nikolakopoulos, “D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,” in 2024 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) , 2024, pp. 3470–3475

work page 2024
[35]

Cooperative modular reinforcement learning for large discrete action space problem,

F. Ming, F. Gao, K. Liu, and C. Zhao, “Cooperative modular reinforcement learning for large discrete action space problem,” Neural Networks , vol. 161, pp. 281–296, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608023000588

work page 2023
[36]

Exploration in deep reinforcement learning: From single-agent to multiagent domain,

J. Hao, T. Yang, H. Tang, C. Bai, J. Liu, Z. Meng, P. Liu, and Z. Wang, “Exploration in deep reinforcement learning: From single-agent to multiagent domain,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 7, p. 8762–8782, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2023.3236361

work page doi:10.1109/tnnls.2023.3236361 2024
[37]

Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,

A. Borzilov, A. Skrynnik, and A. Panov, “Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,” in International Conference on Computational Optimization, 2024. [Online]. Available: https://openreview.net/forum?id=RzoxFLA966

work page 2024
[38]

Scalable evaluation of multi-agent reinforcement learning with melting pot,

J. Z. Leibo, E. A. Due ˜nez-Guzman, A. Vezhnevets, J. P. Agapiou, P. Sunehag, R. Koster, J. Matyas, C. Beattie, I. Mordatch, and T. Graepel, “Scalable evaluation of multi-agent reinforcement learning with melting pot,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zh...

work page 2021
[39]

Sample-efficient multiagent reinforcement learning with reset replay,

Y . Yang, G. Chen, J. HAO, and P.-A. Heng, “Sample-efficient multiagent reinforcement learning with reset replay,” in Forty-first International Conference on Machine Learning , 2024. [Online]. Available: https://openreview.net/forum?id=w8ei1o9U5y

work page 2024
[40]

Self- organized group for cooperative multi-agent reinforcement learning,

J. Shao, Z. Lou, H. Zhang, Y . Jiang, S. He, and X. Ji, “Self- organized group for cooperative multi-agent reinforcement learning,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022
[41]

Google research football: A novel reinforcement learning environment,

K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly, “Google research football: A novel reinforcement learning environment,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 4501–4510, 2020. [Online]. Available: https://doi.org/10.1609/aaa...

work page doi:10.1609/aaai.v34i04.5878 2020
[42]

Pommerman: A multi-agent playground,

C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, and J. Bruna, “Pommerman: A multi-agent playground,” 2022. [Online]. Available: https://arxiv.org/abs/1809.07124

work page arXiv 2022
[43]

The neural mmo platform for massively multiagent research,

J. Suarez, Y . Du, C. Zhu, I. Mordatch, and P. Isola, “The neural mmo platform for massively multiagent research,” 2021. [Online]. Available: https://arxiv.org/abs/2110.07594

work page arXiv 2021
[44]

Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

G. Papoudakis, F. Christianos, L. Sch ¨afer, and S. V . Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” 2021. [Online]. Available: https://arxiv.org/abs/2006. 07869

work page 2021
[45]

A survey on multi-agent reinforcement learning and its application,

Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S2949855424000042

work page 2024
[46]

A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives

W. Jin, H. Du, B. Zhao, X. Tian, B. Shi, and G. Yang, “A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13415

work page arXiv 2025
[47]

F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, ser. SpringerBriefs in Intelligent Systems. Springer Cham, 2016, vol. 1, no. 134, 14 b/w illustrations, 22 colour illustrations. [Online]. Available: https://doi.org/10.1007/978-3-319-28929-8

work page doi:10.1007/978-3-319-28929-8 2016
[48]

Cooperative multi-agent control using deep reinforcement learning,

J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Autonomous Agents and Multiagent Systems , G. Sukthankar and J. A. Rodriguez-Aguilar, Eds. Cham: Springer International Publishing, 2017, pp. 66–83

work page 2017
[49]

Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,

K. R. Chandra and S. Borugadda, “Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,” in 2023 Second International JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 202...

work page 2023
[50]

Deep q-network based multi-agent reinforcement learning with binary action agents,

A. M. Hafiz and G. M. Bhat, “Deep q-network based multi-agent reinforcement learning with binary action agents,” 2020. [Online]. Available: https://arxiv.org/abs/2008.04109

work page arXiv 2020
[51]

The dynamics of reinforcement learning in cooperative multiagent systems,

C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” in Proceedings of the fifteenth na- tional/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence . AAAI Press, 1998, pp. 746–752

work page 1998
[52]

Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,

T. Ikeda and T. Shibuya, “Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,” in 2022 61st Annual Conference of the Society of Instrument and Control Engineers (SICE) , 2022, pp. 135–140

work page 2022
[53]

An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024

C. Amato, “An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2409.03052

work page arXiv 2024
[54]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022
[55]

Multi- agent actor-critic for mixed cooperative-competitive environments,

R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- agent actor-critic for mixed cooperative-competitive environments,”

work page
[56]

Available: https://arxiv.org/abs/1706.02275

[Online]. Available: https://arxiv.org/abs/1706.02275

work page arXiv
[57]

ISBN 9781605585161

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009
[58]

Self-paced learning for latent variable models,

M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems , J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., vol. 23. Curran Associates, Inc.,

work page
[59]

Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf

work page 2010
[60]

TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game

P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y . Zheng, J. Liu, Y . Liu, H. Liu, and T. Zhang, “Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game,” 2018. [Online]. Available: https://arxiv.org/abs/1809.07193

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

Qplex: Duplex dueling multi-agent q-learning,

J. Wang, Z. Ren, T. Liu, Y . Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=Rcmk0xxIQV Weiqiang Jin (Student Member, IEEE) is currently a Ph.D. candidate in Electronic and Information En- gineering at Xi’an Jiaotong Univ...

work page 2021

[1] [1]

A survey on multi-agent reinforcement learning and its application,

Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024

work page 2024

[2] [3]

A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,

P. Yadav, A. Mishra, and S. Kim, “A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,” Sensors, vol. 23, no. 10, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/10/4710

work page 2023

[3] [4]

Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,

Y . Zhang, G. Zheng, Z. Liu, Q. Li, and H. Zeng, “Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,” IEEE Transactions on Visualization and Computer Graphics , p. 1–16, 2024. [Online]. Available: http: //dx.doi.org/10.1109/TVCG.2024.3392587

work page doi:10.1109/tvcg.2024.3392587 2024

[4] [5]

Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey,

R. Zhang, J. Hou, F. Walter, S. Gu, J. Guan, F. R ¨ohrbein, Y . Du, P. Cai, G. Chen, and A. Knoll, “Multi-agent reinforcement learning for autonomous driving: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2408.09675

work page arXiv 2024

[5] [6]

A survey of multi-agent deep reinforcement learning with communication,

C. Zhu, M. Dastani, and S. Wang, “A survey of multi-agent deep reinforcement learning with communication,” Autonomous Agents and Multi-Agent Systems , vol. 38, no. 1, p. 4, 2024. [Online]. Available: https://doi.org/10.1007/s10458-023-09633-6

work page doi:10.1007/s10458-023-09633-6 2024

[6] [7]

Deep reinforcement learning: A survey,

X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, B. Dai, and Q. Miao, “Deep reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 4, pp. 5064– 5078, 2024

work page 2024

[7] [8]

Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,

T. Rashid, G. Farquhar, B. Peng, and S. Whiteson, “Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,” in Proceedings of the 34th International Con- ference on Neural Information Processing Systems , ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

work page 2020

[8] [9]

Episodic multi-agent reinforcement learning with curiosity- driven exploration,

L. Zheng, J. Chen, J. Wang, J. He, Y . Hu, Y . Chen, C. Fan, Y . Gao, and C. Zhang, “Episodic multi-agent reinforcement learning with curiosity- driven exploration,” in Proceedings of the 35th International Conference on Neural Information Processing Systems , ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

work page 2021

[9] [10]

Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,

A. Authors, “Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,” in International Conference on Learning Representations (ICLR) , 2024, oral Presentation. [Online]. Available: https://iclr.cc/virtual/2024/oral/19766

work page 2024

[10] [12]

A survey of reinforcement learning algorithms for dynamically varying environments,

S. Padakandla, “A survey of reinforcement learning algorithms for dynamically varying environments,” ACM Comput. Surv., vol. 54, no. 6, Jul. 2021. [Online]. Available: https://doi.org/10.1145/3459991

work page doi:10.1145/3459991 2021

[11] [13]

A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A survey of learning in multiagent environments: Dealing with non- stationarity,” 2019. [Online]. Available: https://arxiv.org/abs/1707.09183

work page internal anchor Pith review Pith/arXiv arXiv 2019

[12] [14]

The StarCraft Multi-Agent Challenge,

M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson, “The starcraft multi-agent challenge,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04043 JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15

work page arXiv 2019

[13] [15]

SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,

B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. N. Foerster, and S. Whiteson, “SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , vol. 36, 2023, pp. 37 567–37 593. [Online]. Available: https://openreview....

work page 2023

[14] [16]

A survey on curriculum learning,

X. Wang, Y . Chen, and W. Zhu, “A survey on curriculum learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 9, pp. 4555–4576, 2022

work page 2022

[15] [17]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [18]

Counterfactual multi-agent policy gradients,

J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” ser. AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018

work page 2018

[17] [19]

Monotonic value function factorisation for deep multi- agent reinforcement learning,

T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,” J. Mach. Learn. Res. , vol. 21, no. 1, Jan. 2020

work page 2020

[18] [20]

Discriminative experience replay for efficient multi- agent reinforcement learning,

H. Xunhan, Z. Jian, Z. Wengang, F. Ruili, and L. Houqiang, “Discriminative experience replay for efficient multi- agent reinforcement learning,” DeepAI, 2023. [Online]. Avail- able: https://deepai.org/publication/discriminative-experience-replay-/ for-efficient-multi-agent-reinforcement-learning

work page 2023

[19] [21]

Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning

G. Papoudakis, F. Christianos, A. Rahman, and S. V . Albrecht, “Dealing with non-stationarity in multi-agent deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.04737

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [22]

Dealing with non- stationarity in MARL via trust-region decomposition,

W. Li, X. Wang, B. Jin, J. Sheng, and H. Zha, “Dealing with non- stationarity in MARL via trust-region decomposition,” in International Conference on Learning Representations , 2022. [Online]. Available: https://openreview.net/forum?id=XHUxf5aRB3s

work page 2022

[21] [23]

Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,

J. Wei, L. Wang, X. Tao, H. Hu, and H. Wu, “Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,” in Web Information Systems and Applications , X. Zhao, S. Yang, X. Wang, and J. Li, Eds. Cham: Springer International Publishing, 2022, pp. 403–415

work page 2022

[22] [24]

Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,

H. Nekoei, A. Badrinaaraayanan, A. Sinha, M. Amini, J. Rajendran, A. Mahajan, and S. Chandar, “Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,” in Proceedings of The 2nd Conference on Lifelong Learning Agents , ser. Proceedings of Machine Learning Research, S. Chandar, R. Pas...

work page 2023

[23] [25]

Monotonic improvement guarantees under non- stationarity for decentralized PPO,

M. Sun, S. Devlin, J. A. Beck, K. Hofmann, and S. Whiteson, “Monotonic improvement guarantees under non- stationarity for decentralized PPO,” 2022. [Online]. Available: https://openreview.net/forum?id=uHv20yi8saL

work page 2022

[24] [26]

Value-decomposition networks for cooperative multi-agent learning based on team reward,

P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proceedings of the 17th Interna- tional Conference on Autonomous Agents and MultiAgent Systems , ser. AAMAS ’18. Richland...

work page 2018

[25] [27]

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning

K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y . Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019. [Online]. Available: https: //arxiv.org/abs/1905.05408

work page internal anchor Pith review Pith/arXiv arXiv 2019

[26] [28]

Team-wise effective communication in multi- agent reinforcement learning,

M. Yang, K. Zhao, Y . Wang, R. Dong, Y . Du, F. Liu, M. Zhou, and L. H. U, “Team-wise effective communication in multi- agent reinforcement learning,” Autonomous Agents and Multi-Agent Systems, vol. 38, no. 2, p. 36, 2024. [Online]. Available: https: //doi.org/10.1007/s10458-024-09665-6

work page doi:10.1007/s10458-024-09665-6 2024

[27] [29]

Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,

X. Guo, D. Shi, and W. Fan, “Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, ser. IJCAI ’23, 2023

work page 2023

[28] [30]

Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,

X. Wang, X. Li, J. Shao, and J. Zhang, “Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,” in Pro- ceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, ser. AAMAS ’23. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2023, p. 427–435

work page 2023

[29] [31]

Communication in multi-agent reinforcement learning: Intention sharing,

W. Kim, J. Park, and Y . Sung, “Communication in multi-agent reinforcement learning: Intention sharing,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=qpsl2dR9twy

work page 2021

[30] [32]

A sequen- tial multi-agent reinforcement learning framework for different action spaces,

S. Tian, M. Yang, R. Xiong, X. He, and S. Rajasegarar, “A sequen- tial multi-agent reinforcement learning framework for different action spaces,” Expert Systems with Applications , vol. 258, p. 125138, 2024

work page 2024

[31] [33]

Addressing high-dimensional continuous action space via decomposed discrete policy-critic,

Y . Zhang, J. Sun, G. Wang, and J. Chen, “Addressing high-dimensional continuous action space via decomposed discrete policy-critic,” 2023. [Online]. Available: https://openreview.net/forum?id=blCpfjAeFkn

work page 2023

[32] [34]

D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,

G. Calzolari, V . Sumathy, C. Kanellakis, and G. Nikolakopoulos, “D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,” in 2024 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) , 2024, pp. 3470–3475

work page 2024

[33] [35]

Cooperative modular reinforcement learning for large discrete action space problem,

F. Ming, F. Gao, K. Liu, and C. Zhao, “Cooperative modular reinforcement learning for large discrete action space problem,” Neural Networks , vol. 161, pp. 281–296, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608023000588

work page 2023

[34] [36]

Exploration in deep reinforcement learning: From single-agent to multiagent domain,

J. Hao, T. Yang, H. Tang, C. Bai, J. Liu, Z. Meng, P. Liu, and Z. Wang, “Exploration in deep reinforcement learning: From single-agent to multiagent domain,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 7, p. 8762–8782, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2023.3236361

work page doi:10.1109/tnnls.2023.3236361 2024

[35] [37]

Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,

A. Borzilov, A. Skrynnik, and A. Panov, “Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,” in International Conference on Computational Optimization, 2024. [Online]. Available: https://openreview.net/forum?id=RzoxFLA966

work page 2024

[36] [38]

Scalable evaluation of multi-agent reinforcement learning with melting pot,

J. Z. Leibo, E. A. Due ˜nez-Guzman, A. Vezhnevets, J. P. Agapiou, P. Sunehag, R. Koster, J. Matyas, C. Beattie, I. Mordatch, and T. Graepel, “Scalable evaluation of multi-agent reinforcement learning with melting pot,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zh...

work page 2021

[37] [39]

Sample-efficient multiagent reinforcement learning with reset replay,

Y . Yang, G. Chen, J. HAO, and P.-A. Heng, “Sample-efficient multiagent reinforcement learning with reset replay,” in Forty-first International Conference on Machine Learning , 2024. [Online]. Available: https://openreview.net/forum?id=w8ei1o9U5y

work page 2024

[38] [40]

Self- organized group for cooperative multi-agent reinforcement learning,

J. Shao, Z. Lou, H. Zhang, Y . Jiang, S. He, and X. Ji, “Self- organized group for cooperative multi-agent reinforcement learning,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022

[39] [41]

Google research football: A novel reinforcement learning environment,

K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly, “Google research football: A novel reinforcement learning environment,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 4501–4510, 2020. [Online]. Available: https://doi.org/10.1609/aaa...

work page doi:10.1609/aaai.v34i04.5878 2020

[40] [42]

Pommerman: A multi-agent playground,

C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, and J. Bruna, “Pommerman: A multi-agent playground,” 2022. [Online]. Available: https://arxiv.org/abs/1809.07124

work page arXiv 2022

[41] [43]

The neural mmo platform for massively multiagent research,

J. Suarez, Y . Du, C. Zhu, I. Mordatch, and P. Isola, “The neural mmo platform for massively multiagent research,” 2021. [Online]. Available: https://arxiv.org/abs/2110.07594

work page arXiv 2021

[42] [44]

Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

G. Papoudakis, F. Christianos, L. Sch ¨afer, and S. V . Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” 2021. [Online]. Available: https://arxiv.org/abs/2006. 07869

work page 2021

[43] [45]

A survey on multi-agent reinforcement learning and its application,

Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S2949855424000042

work page 2024

[44] [46]

A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives

W. Jin, H. Du, B. Zhao, X. Tian, B. Shi, and G. Yang, “A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13415

work page arXiv 2025

[45] [47]

F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, ser. SpringerBriefs in Intelligent Systems. Springer Cham, 2016, vol. 1, no. 134, 14 b/w illustrations, 22 colour illustrations. [Online]. Available: https://doi.org/10.1007/978-3-319-28929-8

work page doi:10.1007/978-3-319-28929-8 2016

[46] [48]

Cooperative multi-agent control using deep reinforcement learning,

J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Autonomous Agents and Multiagent Systems , G. Sukthankar and J. A. Rodriguez-Aguilar, Eds. Cham: Springer International Publishing, 2017, pp. 66–83

work page 2017

[47] [49]

Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,

K. R. Chandra and S. Borugadda, “Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,” in 2023 Second International JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 202...

work page 2023

[48] [50]

Deep q-network based multi-agent reinforcement learning with binary action agents,

A. M. Hafiz and G. M. Bhat, “Deep q-network based multi-agent reinforcement learning with binary action agents,” 2020. [Online]. Available: https://arxiv.org/abs/2008.04109

work page arXiv 2020

[49] [51]

The dynamics of reinforcement learning in cooperative multiagent systems,

C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” in Proceedings of the fifteenth na- tional/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence . AAAI Press, 1998, pp. 746–752

work page 1998

[50] [52]

Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,

T. Ikeda and T. Shibuya, “Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,” in 2022 61st Annual Conference of the Society of Instrument and Control Engineers (SICE) , 2022, pp. 135–140

work page 2022

[51] [53]

An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024

C. Amato, “An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2409.03052

work page arXiv 2024

[52] [54]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022

[53] [55]

Multi- agent actor-critic for mixed cooperative-competitive environments,

R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- agent actor-critic for mixed cooperative-competitive environments,”

work page

[54] [56]

Available: https://arxiv.org/abs/1706.02275

[Online]. Available: https://arxiv.org/abs/1706.02275

work page arXiv

[55] [57]

ISBN 9781605585161

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009

[56] [58]

Self-paced learning for latent variable models,

M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems , J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., vol. 23. Curran Associates, Inc.,

work page

[57] [59]

Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf

work page 2010

[58] [60]

TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game

P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y . Zheng, J. Liu, Y . Liu, H. Liu, and T. Zhang, “Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game,” 2018. [Online]. Available: https://arxiv.org/abs/1809.07193

work page internal anchor Pith review Pith/arXiv arXiv 2018

[59] [61]

Qplex: Duplex dueling multi-agent q-learning,

J. Wang, Z. Ren, T. Liu, Y . Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=Rcmk0xxIQV Weiqiang Jin (Student Member, IEEE) is currently a Ph.D. candidate in Electronic and Information En- gineering at Xi’an Jiaotong Univ...

work page 2021