pith. sign in

arxiv: 2506.07548 · v2 · submitted 2025-06-09 · 💻 cs.AI · cs.RO

Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage

Pith reviewed 2026-05-19 11:05 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords multi-agent reinforcement learningcurriculum learningadaptive difficultycounterfactual advantageStarCraft Multi-Agent Challengeenvironmental meta-stationaritygroup relative policy optimization
0
0 comments X

The pith

Adaptive curriculum driven by win-rate signals and counterfactual group advantages overcomes fixed-difficulty limits in multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that training multi-agent reinforcement learning agents against opponents at a single fixed difficulty creates environmental meta-stationarity, which restricts how well policies generalize and pushes learning into shallow local optima. To address this, the authors introduce CL-MARL, a framework that dynamically adjusts the difficulty of the task online based on the agents' win rates. This is paired with a new optimization method called Counterfactual Group Relative Policy Advantage that helps separate each agent's contribution despite the changing team dynamics. If correct, this would allow agents to build skills more progressively and achieve better performance on challenging cooperative tasks like those in StarCraft.

Core claim

The paper claims that static-difficulty training in MARL creates environmental meta-stationarity that caps generalization and steers learning to shallow optima, and that this can be overcome with CL-MARL, which adapts opponent strength online from win-rate signals via the FlexDiff scheduler fusing momentum-based estimation with sliding-window dual-curve monitoring, and introduces the Counterfactual Group Relative Policy Advantage to disentangle agent contributions under shifting dynamics.

What carries the argument

The FlexDiff scheduler for adaptive difficulty transitions based on win-rate signals together with the Counterfactual Group Relative Policy Advantage (CGRPA) for optimization in non-stationary team settings.

If this is right

  • CL-MARL achieves a 40% mean win rate on super-hard SMAC maps with an average episode return of 17.85.
  • It outperforms QMIX, OW-QMIX, DER, EMC, and MARR baselines by an average of +2.94.
  • Peak win rates are reached approximately 1.28 times faster on the 8m_vs_9m map and 1.42 times faster on the 3s5z_vs_3s6z map compared to the strongest baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This adaptive approach could be tested in other MARL environments where difficulty can be parameterized to see if similar gains in generalization occur.
  • The reliance on win rates suggests that alternative performance metrics might be needed for tasks without binary win/loss outcomes.
  • Extending the counterfactual mechanism might help address non-stationarity in single-agent RL with changing environments as well.

Load-bearing premise

Win-rate signals provide a sufficiently clean and timely indicator of agent mastery that can be used to drive stable difficulty transitions without introducing excessive noise or requiring per-map manual tuning.

What would settle it

Running the method on the super-hard SMAC maps and finding that the difficulty level oscillates frequently without leading to improved final win rates over fixed-difficulty training would indicate that the adaptive curriculum does not reliably overcome meta-stationarity.

Figures

Figures reproduced from arXiv: 2506.07548 by Biao Zhao, Hongyang Du, Jinhu Qi, Junli Wang, Shixiang Tang, Weiqiang Jin, Wentao Zhang, Yang Liu.

Figure 1
Figure 1. Figure 1: When CL Meets MARL: A comparative illustration of training paradigms between CL-based supervised learning and CL-based MARL. Part 1 and Part 2 represent the traditional CL framework for supervised learning (e.g., CV, NLP) with labeled data and its optimization trajectories (supervision loss minimization), and Part 3 and Part 4 are the CL-based MARL training framework in the SMAC adversarial environment (2s… view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of our proposed CL-based MARL framework for cooperative-adversarial scenarios in SMAC. The [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The detailed visualization of our proposed CGRPA [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The typical SMAC benchmark scenarios and maps [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Baseline Comparisons of the six easy maps in the SMAC benchmark. The x-axis represents the global training steps, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Baseline Comparisons of the eight maps of [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Baseline Comparisons of the eight maps of [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance Comparison of Convergence Risks In [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparative Analysis on Normalized Win Rates of [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 11-12
Figure 11-12. Figure 11-12: It was evident that, under the extended simulation [PITH_FULL_IMAGE:figures/full_fig_p013_11-12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of the Remaining Survival Status of Units [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of the Remaining Survival Status of Units in [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that fixed-difficulty training in MARL induces environmental meta-stationarity, which limits generalization and leads to shallow optima. It proposes CL-MARL, which uses the FlexDiff scheduler to dynamically adjust opponent difficulty from win-rate and return signals via momentum estimation and dual-curve sliding-window monitoring, together with the Counterfactual Group Relative Policy Advantage (CGRPA) estimator to mitigate the resulting non-stationarity. On SMAC super-hard maps the method is reported to reach a 40% mean win rate and average return of 17.85, outperforming QMIX, OW-QMIX, DER, EMC and MARR by +2.94 on average while attaining peak performance 1.28–1.42 times faster on two maps.

Significance. If the reported gains prove robust, the work would be significant for practical MARL by showing that online curriculum adaptation can improve both final performance and sample efficiency on cooperative tasks against scripted opponents. The public implementation strengthens reproducibility. The combination of a scheduler driven by sparse binary signals and a counterfactual group advantage estimator addresses a concrete training pathology, though the magnitude of improvement remains modest relative to the added complexity.

major comments (2)
  1. [Abstract / §4] Abstract and experimental results: the concrete performance claims (40% win rate, +2.94 average improvement, 1.28× and 1.42× speed-ups) are presented without any indication of run count, standard deviation, confidence intervals or statistical tests. This information is load-bearing for the central empirical claim that CL-MARL exceeds the listed baselines.
  2. [§3.2] §3.2 (FlexDiff scheduler): the claim that momentum-based trend estimation plus dual-curve monitoring yields “stable difficulty transitions without manual tuning” is central to overcoming meta-stationarity, yet no variance plots, sensitivity analysis to window size, or failure-case statistics are supplied for maps whose mean win rate is only ~40%. On such maps individual episodes produce sparse binary signals, raising the risk that the scheduler oscillates or stalls exactly as the skeptic note anticipates.
minor comments (2)
  1. [Abstract] Abstract: “1.28faster” is missing a space and is inconsistent with the subsequent “1.42 faster”; standardize phrasing.
  2. [§3.3] Notation for CGRPA: the extension of GRPO with counterfactual baselines would benefit from an explicit equation defining the advantage estimator before the empirical section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the empirical support and analysis in the manuscript.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and experimental results: the concrete performance claims (40% win rate, +2.94 average improvement, 1.28× and 1.42× speed-ups) are presented without any indication of run count, standard deviation, confidence intervals or statistical tests. This information is load-bearing for the central empirical claim that CL-MARL exceeds the listed baselines.

    Authors: We agree that statistical rigor is necessary to substantiate the reported performance gains. In the revised manuscript we will explicitly state that all results are averaged over 5 independent random seeds, report standard deviations alongside the mean win rates and returns, include 95% confidence intervals, and add paired t-test p-values comparing CL-MARL against each baseline. These details will be added to the abstract, Section 4, and the corresponding tables and figures. revision: yes

  2. Referee: [§3.2] §3.2 (FlexDiff scheduler): the claim that momentum-based trend estimation plus dual-curve monitoring yields “stable difficulty transitions without manual tuning” is central to overcoming meta-stationarity, yet no variance plots, sensitivity analysis to window size, or failure-case statistics are supplied for maps whose mean win rate is only ~40%. On such maps individual episodes produce sparse binary signals, raising the risk that the scheduler oscillates or stalls exactly as the skeptic note anticipates.

    Authors: We acknowledge that the current manuscript lacks explicit stability diagnostics for FlexDiff under sparse win-rate signals. In the revision we will add (i) variance plots of the estimated difficulty level and momentum trend over training for the super-hard maps, (ii) a sensitivity table varying the sliding-window sizes and momentum coefficient, and (iii) a short discussion of any observed oscillations or stalls together with the conditions under which they were mitigated. These additions will directly address the concern about robustness on maps with ~40% mean win rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical outcomes rather than self-referential derivations

full rationale

The paper defines environmental meta-stationarity as a new term for fixed-difficulty training, then proposes CL-MARL with FlexDiff (momentum + dual-curve win-rate monitoring) and CGRPA (counterfactual group-relative advantage). These are algorithmic constructions whose performance is reported as measured results on SMAC super-hard maps (40% mean win rate, +2.94 average return over baselines). No equation or scheduler step is shown to equal its own input by construction, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the central result. The reported speed-ups and win-rate gains are presented as experimental outcomes of the scheduler and estimator, not as identities derived from the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The method rests on standard MARL assumptions plus two new algorithmic pieces whose internal parameters and stability properties are not detailed in the abstract.

axioms (1)
  • domain assumption Win rate serves as a reliable proxy for mastery that can drive difficulty changes without destabilizing learning.
    Invoked to justify the FlexDiff scheduler decisions.
invented entities (2)
  • FlexDiff scheduler no independent evidence
    purpose: Online adaptation of opponent strength from win-rate signals
    Fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns.
  • CGRPA no independent evidence
    purpose: Disentangle individual agent contributions under non-stationary team dynamics
    Extends GRPO-style group-relative optimization with counterfactual baselines.

pith-pipeline@v0.9.0 · 5837 in / 1448 out tokens · 46264 ms · 2026-05-19T11:05:18.504955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

  1. [1]

    A survey on multi-agent reinforcement learning and its application,

    Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024

  2. [3]

    A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,

    P. Yadav, A. Mishra, and S. Kim, “A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,” Sensors, vol. 23, no. 10, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/10/4710

  3. [4]

    Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,

    Y . Zhang, G. Zheng, Z. Liu, Q. Li, and H. Zeng, “Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,” IEEE Transactions on Visualization and Computer Graphics , p. 1–16, 2024. [Online]. Available: http: //dx.doi.org/10.1109/TVCG.2024.3392587

  4. [5]

    Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey,

    R. Zhang, J. Hou, F. Walter, S. Gu, J. Guan, F. R ¨ohrbein, Y . Du, P. Cai, G. Chen, and A. Knoll, “Multi-agent reinforcement learning for autonomous driving: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2408.09675

  5. [6]

    A survey of multi-agent deep reinforcement learning with communication,

    C. Zhu, M. Dastani, and S. Wang, “A survey of multi-agent deep reinforcement learning with communication,” Autonomous Agents and Multi-Agent Systems , vol. 38, no. 1, p. 4, 2024. [Online]. Available: https://doi.org/10.1007/s10458-023-09633-6

  6. [7]

    Deep reinforcement learning: A survey,

    X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, B. Dai, and Q. Miao, “Deep reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 4, pp. 5064– 5078, 2024

  7. [8]

    Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,

    T. Rashid, G. Farquhar, B. Peng, and S. Whiteson, “Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,” in Proceedings of the 34th International Con- ference on Neural Information Processing Systems , ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  8. [9]

    Episodic multi-agent reinforcement learning with curiosity- driven exploration,

    L. Zheng, J. Chen, J. Wang, J. He, Y . Hu, Y . Chen, C. Fan, Y . Gao, and C. Zhang, “Episodic multi-agent reinforcement learning with curiosity- driven exploration,” in Proceedings of the 35th International Conference on Neural Information Processing Systems , ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021

  9. [10]

    Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,

    A. Authors, “Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,” in International Conference on Learning Representations (ICLR) , 2024, oral Presentation. [Online]. Available: https://iclr.cc/virtual/2024/oral/19766

  10. [12]

    A survey of reinforcement learning algorithms for dynamically varying environments,

    S. Padakandla, “A survey of reinforcement learning algorithms for dynamically varying environments,” ACM Comput. Surv., vol. 54, no. 6, Jul. 2021. [Online]. Available: https://doi.org/10.1145/3459991

  11. [13]

    A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

    P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A survey of learning in multiagent environments: Dealing with non- stationarity,” 2019. [Online]. Available: https://arxiv.org/abs/1707.09183

  12. [14]

    The StarCraft Multi-Agent Challenge,

    M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson, “The starcraft multi-agent challenge,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04043 JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15

  13. [15]

    SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,

    B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. N. Foerster, and S. Whiteson, “SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , vol. 36, 2023, pp. 37 567–37 593. [Online]. Available: https://openreview....

  14. [16]

    A survey on curriculum learning,

    X. Wang, Y . Chen, and W. Zhu, “A survey on curriculum learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 9, pp. 4555–4576, 2022

  15. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

  16. [18]

    Counterfactual multi-agent policy gradients,

    J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” ser. AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018

  17. [19]

    Monotonic value function factorisation for deep multi- agent reinforcement learning,

    T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,” J. Mach. Learn. Res. , vol. 21, no. 1, Jan. 2020

  18. [20]

    Discriminative experience replay for efficient multi- agent reinforcement learning,

    H. Xunhan, Z. Jian, Z. Wengang, F. Ruili, and L. Houqiang, “Discriminative experience replay for efficient multi- agent reinforcement learning,” DeepAI, 2023. [Online]. Avail- able: https://deepai.org/publication/discriminative-experience-replay-/ for-efficient-multi-agent-reinforcement-learning

  19. [21]

    Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning

    G. Papoudakis, F. Christianos, A. Rahman, and S. V . Albrecht, “Dealing with non-stationarity in multi-agent deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.04737

  20. [22]

    Dealing with non- stationarity in MARL via trust-region decomposition,

    W. Li, X. Wang, B. Jin, J. Sheng, and H. Zha, “Dealing with non- stationarity in MARL via trust-region decomposition,” in International Conference on Learning Representations , 2022. [Online]. Available: https://openreview.net/forum?id=XHUxf5aRB3s

  21. [23]

    Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,

    J. Wei, L. Wang, X. Tao, H. Hu, and H. Wu, “Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,” in Web Information Systems and Applications , X. Zhao, S. Yang, X. Wang, and J. Li, Eds. Cham: Springer International Publishing, 2022, pp. 403–415

  22. [24]

    Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,

    H. Nekoei, A. Badrinaaraayanan, A. Sinha, M. Amini, J. Rajendran, A. Mahajan, and S. Chandar, “Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,” in Proceedings of The 2nd Conference on Lifelong Learning Agents , ser. Proceedings of Machine Learning Research, S. Chandar, R. Pas...

  23. [25]

    Monotonic improvement guarantees under non- stationarity for decentralized PPO,

    M. Sun, S. Devlin, J. A. Beck, K. Hofmann, and S. Whiteson, “Monotonic improvement guarantees under non- stationarity for decentralized PPO,” 2022. [Online]. Available: https://openreview.net/forum?id=uHv20yi8saL

  24. [26]

    Value-decomposition networks for cooperative multi-agent learning based on team reward,

    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proceedings of the 17th Interna- tional Conference on Autonomous Agents and MultiAgent Systems , ser. AAMAS ’18. Richland...

  25. [27]

    QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning

    K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y . Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019. [Online]. Available: https: //arxiv.org/abs/1905.05408

  26. [28]

    Team-wise effective communication in multi- agent reinforcement learning,

    M. Yang, K. Zhao, Y . Wang, R. Dong, Y . Du, F. Liu, M. Zhou, and L. H. U, “Team-wise effective communication in multi- agent reinforcement learning,” Autonomous Agents and Multi-Agent Systems, vol. 38, no. 2, p. 36, 2024. [Online]. Available: https: //doi.org/10.1007/s10458-024-09665-6

  27. [29]

    Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,

    X. Guo, D. Shi, and W. Fan, “Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, ser. IJCAI ’23, 2023

  28. [30]

    Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,

    X. Wang, X. Li, J. Shao, and J. Zhang, “Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,” in Pro- ceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, ser. AAMAS ’23. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2023, p. 427–435

  29. [31]

    Communication in multi-agent reinforcement learning: Intention sharing,

    W. Kim, J. Park, and Y . Sung, “Communication in multi-agent reinforcement learning: Intention sharing,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=qpsl2dR9twy

  30. [32]

    A sequen- tial multi-agent reinforcement learning framework for different action spaces,

    S. Tian, M. Yang, R. Xiong, X. He, and S. Rajasegarar, “A sequen- tial multi-agent reinforcement learning framework for different action spaces,” Expert Systems with Applications , vol. 258, p. 125138, 2024

  31. [33]

    Addressing high-dimensional continuous action space via decomposed discrete policy-critic,

    Y . Zhang, J. Sun, G. Wang, and J. Chen, “Addressing high-dimensional continuous action space via decomposed discrete policy-critic,” 2023. [Online]. Available: https://openreview.net/forum?id=blCpfjAeFkn

  32. [34]

    D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,

    G. Calzolari, V . Sumathy, C. Kanellakis, and G. Nikolakopoulos, “D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,” in 2024 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) , 2024, pp. 3470–3475

  33. [35]

    Cooperative modular reinforcement learning for large discrete action space problem,

    F. Ming, F. Gao, K. Liu, and C. Zhao, “Cooperative modular reinforcement learning for large discrete action space problem,” Neural Networks , vol. 161, pp. 281–296, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608023000588

  34. [36]

    Exploration in deep reinforcement learning: From single-agent to multiagent domain,

    J. Hao, T. Yang, H. Tang, C. Bai, J. Liu, Z. Meng, P. Liu, and Z. Wang, “Exploration in deep reinforcement learning: From single-agent to multiagent domain,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 7, p. 8762–8782, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2023.3236361

  35. [37]

    Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,

    A. Borzilov, A. Skrynnik, and A. Panov, “Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,” in International Conference on Computational Optimization, 2024. [Online]. Available: https://openreview.net/forum?id=RzoxFLA966

  36. [38]

    Scalable evaluation of multi-agent reinforcement learning with melting pot,

    J. Z. Leibo, E. A. Due ˜nez-Guzman, A. Vezhnevets, J. P. Agapiou, P. Sunehag, R. Koster, J. Matyas, C. Beattie, I. Mordatch, and T. Graepel, “Scalable evaluation of multi-agent reinforcement learning with melting pot,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zh...

  37. [39]

    Sample-efficient multiagent reinforcement learning with reset replay,

    Y . Yang, G. Chen, J. HAO, and P.-A. Heng, “Sample-efficient multiagent reinforcement learning with reset replay,” in Forty-first International Conference on Machine Learning , 2024. [Online]. Available: https://openreview.net/forum?id=w8ei1o9U5y

  38. [40]

    Self- organized group for cooperative multi-agent reinforcement learning,

    J. Shao, Z. Lou, H. Zhang, Y . Jiang, S. He, and X. Ji, “Self- organized group for cooperative multi-agent reinforcement learning,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  39. [41]

    Google research football: A novel reinforcement learning environment,

    K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly, “Google research football: A novel reinforcement learning environment,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 4501–4510, 2020. [Online]. Available: https://doi.org/10.1609/aaa...

  40. [42]

    Pommerman: A multi-agent playground,

    C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, and J. Bruna, “Pommerman: A multi-agent playground,” 2022. [Online]. Available: https://arxiv.org/abs/1809.07124

  41. [43]

    The neural mmo platform for massively multiagent research,

    J. Suarez, Y . Du, C. Zhu, I. Mordatch, and P. Isola, “The neural mmo platform for massively multiagent research,” 2021. [Online]. Available: https://arxiv.org/abs/2110.07594

  42. [44]

    Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

    G. Papoudakis, F. Christianos, L. Sch ¨afer, and S. V . Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” 2021. [Online]. Available: https://arxiv.org/abs/2006. 07869

  43. [45]

    A survey on multi-agent reinforcement learning and its application,

    Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S2949855424000042

  44. [46]

    A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives

    W. Jin, H. Du, B. Zhao, X. Tian, B. Shi, and G. Yang, “A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13415

  45. [47]

    F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, ser. SpringerBriefs in Intelligent Systems. Springer Cham, 2016, vol. 1, no. 134, 14 b/w illustrations, 22 colour illustrations. [Online]. Available: https://doi.org/10.1007/978-3-319-28929-8

  46. [48]

    Cooperative multi-agent control using deep reinforcement learning,

    J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Autonomous Agents and Multiagent Systems , G. Sukthankar and J. A. Rodriguez-Aguilar, Eds. Cham: Springer International Publishing, 2017, pp. 66–83

  47. [49]

    Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,

    K. R. Chandra and S. Borugadda, “Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,” in 2023 Second International JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 202...

  48. [50]

    Deep q-network based multi-agent reinforcement learning with binary action agents,

    A. M. Hafiz and G. M. Bhat, “Deep q-network based multi-agent reinforcement learning with binary action agents,” 2020. [Online]. Available: https://arxiv.org/abs/2008.04109

  49. [51]

    The dynamics of reinforcement learning in cooperative multiagent systems,

    C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” in Proceedings of the fifteenth na- tional/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence . AAAI Press, 1998, pp. 746–752

  50. [52]

    Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,

    T. Ikeda and T. Shibuya, “Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,” in 2022 61st Annual Conference of the Society of Instrument and Control Engineers (SICE) , 2022, pp. 135–140

  51. [53]

    An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024

    C. Amato, “An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2409.03052

  52. [54]

    The surprising effectiveness of ppo in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  53. [55]

    Multi- agent actor-critic for mixed cooperative-competitive environments,

    R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- agent actor-critic for mixed cooperative-competitive environments,”

  54. [56]

    Available: https://arxiv.org/abs/1706.02275

    [Online]. Available: https://arxiv.org/abs/1706.02275

  55. [57]

    ISBN 9781605585161

    Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380

  56. [58]

    Self-paced learning for latent variable models,

    M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems , J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., vol. 23. Curran Associates, Inc.,

  57. [59]

    Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf

    [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf

  58. [60]

    TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game

    P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y . Zheng, J. Liu, Y . Liu, H. Liu, and T. Zhang, “Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game,” 2018. [Online]. Available: https://arxiv.org/abs/1809.07193

  59. [61]

    Qplex: Duplex dueling multi-agent q-learning,

    J. Wang, Z. Ren, T. Liu, Y . Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=Rcmk0xxIQV Weiqiang Jin (Student Member, IEEE) is currently a Ph.D. candidate in Electronic and Information En- gineering at Xi’an Jiaotong Univ...