Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage
Pith reviewed 2026-05-19 11:05 UTC · model grok-4.3
The pith
Adaptive curriculum driven by win-rate signals and counterfactual group advantages overcomes fixed-difficulty limits in multi-agent reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that static-difficulty training in MARL creates environmental meta-stationarity that caps generalization and steers learning to shallow optima, and that this can be overcome with CL-MARL, which adapts opponent strength online from win-rate signals via the FlexDiff scheduler fusing momentum-based estimation with sliding-window dual-curve monitoring, and introduces the Counterfactual Group Relative Policy Advantage to disentangle agent contributions under shifting dynamics.
What carries the argument
The FlexDiff scheduler for adaptive difficulty transitions based on win-rate signals together with the Counterfactual Group Relative Policy Advantage (CGRPA) for optimization in non-stationary team settings.
If this is right
- CL-MARL achieves a 40% mean win rate on super-hard SMAC maps with an average episode return of 17.85.
- It outperforms QMIX, OW-QMIX, DER, EMC, and MARR baselines by an average of +2.94.
- Peak win rates are reached approximately 1.28 times faster on the 8m_vs_9m map and 1.42 times faster on the 3s5z_vs_3s6z map compared to the strongest baseline.
Where Pith is reading between the lines
- This adaptive approach could be tested in other MARL environments where difficulty can be parameterized to see if similar gains in generalization occur.
- The reliance on win rates suggests that alternative performance metrics might be needed for tasks without binary win/loss outcomes.
- Extending the counterfactual mechanism might help address non-stationarity in single-agent RL with changing environments as well.
Load-bearing premise
Win-rate signals provide a sufficiently clean and timely indicator of agent mastery that can be used to drive stable difficulty transitions without introducing excessive noise or requiring per-map manual tuning.
What would settle it
Running the method on the super-hard SMAC maps and finding that the difficulty level oscillates frequently without leading to improved final win rates over fixed-difficulty training would indicate that the adaptive curriculum does not reliably overcome meta-stationarity.
Figures
read the original abstract
Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8m_vs_9m and 1.42 faster on 3s5z_vs_3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that fixed-difficulty training in MARL induces environmental meta-stationarity, which limits generalization and leads to shallow optima. It proposes CL-MARL, which uses the FlexDiff scheduler to dynamically adjust opponent difficulty from win-rate and return signals via momentum estimation and dual-curve sliding-window monitoring, together with the Counterfactual Group Relative Policy Advantage (CGRPA) estimator to mitigate the resulting non-stationarity. On SMAC super-hard maps the method is reported to reach a 40% mean win rate and average return of 17.85, outperforming QMIX, OW-QMIX, DER, EMC and MARR by +2.94 on average while attaining peak performance 1.28–1.42 times faster on two maps.
Significance. If the reported gains prove robust, the work would be significant for practical MARL by showing that online curriculum adaptation can improve both final performance and sample efficiency on cooperative tasks against scripted opponents. The public implementation strengthens reproducibility. The combination of a scheduler driven by sparse binary signals and a counterfactual group advantage estimator addresses a concrete training pathology, though the magnitude of improvement remains modest relative to the added complexity.
major comments (2)
- [Abstract / §4] Abstract and experimental results: the concrete performance claims (40% win rate, +2.94 average improvement, 1.28× and 1.42× speed-ups) are presented without any indication of run count, standard deviation, confidence intervals or statistical tests. This information is load-bearing for the central empirical claim that CL-MARL exceeds the listed baselines.
- [§3.2] §3.2 (FlexDiff scheduler): the claim that momentum-based trend estimation plus dual-curve monitoring yields “stable difficulty transitions without manual tuning” is central to overcoming meta-stationarity, yet no variance plots, sensitivity analysis to window size, or failure-case statistics are supplied for maps whose mean win rate is only ~40%. On such maps individual episodes produce sparse binary signals, raising the risk that the scheduler oscillates or stalls exactly as the skeptic note anticipates.
minor comments (2)
- [Abstract] Abstract: “1.28faster” is missing a space and is inconsistent with the subsequent “1.42 faster”; standardize phrasing.
- [§3.3] Notation for CGRPA: the extension of GRPO with counterfactual baselines would benefit from an explicit equation defining the advantage estimator before the empirical section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the empirical support and analysis in the manuscript.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and experimental results: the concrete performance claims (40% win rate, +2.94 average improvement, 1.28× and 1.42× speed-ups) are presented without any indication of run count, standard deviation, confidence intervals or statistical tests. This information is load-bearing for the central empirical claim that CL-MARL exceeds the listed baselines.
Authors: We agree that statistical rigor is necessary to substantiate the reported performance gains. In the revised manuscript we will explicitly state that all results are averaged over 5 independent random seeds, report standard deviations alongside the mean win rates and returns, include 95% confidence intervals, and add paired t-test p-values comparing CL-MARL against each baseline. These details will be added to the abstract, Section 4, and the corresponding tables and figures. revision: yes
-
Referee: [§3.2] §3.2 (FlexDiff scheduler): the claim that momentum-based trend estimation plus dual-curve monitoring yields “stable difficulty transitions without manual tuning” is central to overcoming meta-stationarity, yet no variance plots, sensitivity analysis to window size, or failure-case statistics are supplied for maps whose mean win rate is only ~40%. On such maps individual episodes produce sparse binary signals, raising the risk that the scheduler oscillates or stalls exactly as the skeptic note anticipates.
Authors: We acknowledge that the current manuscript lacks explicit stability diagnostics for FlexDiff under sparse win-rate signals. In the revision we will add (i) variance plots of the estimated difficulty level and momentum trend over training for the super-hard maps, (ii) a sensitivity table varying the sliding-window sizes and momentum coefficient, and (iii) a short discussion of any observed oscillations or stalls together with the conditions under which they were mitigated. These additions will directly address the concern about robustness on maps with ~40% mean win rate. revision: yes
Circularity Check
No significant circularity; claims rest on empirical outcomes rather than self-referential derivations
full rationale
The paper defines environmental meta-stationarity as a new term for fixed-difficulty training, then proposes CL-MARL with FlexDiff (momentum + dual-curve win-rate monitoring) and CGRPA (counterfactual group-relative advantage). These are algorithmic constructions whose performance is reported as measured results on SMAC super-hard maps (40% mean win rate, +2.94 average return over baselines). No equation or scheduler step is shown to equal its own input by construction, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the central result. The reported speed-ups and win-rate gains are presented as experimental outcomes of the scheduler and estimator, not as identities derived from the inputs themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Win rate serves as a reliable proxy for mastery that can drive difficulty changes without destabilizing learning.
invented entities (2)
-
FlexDiff scheduler
no independent evidence
-
CGRPA
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FlexDiff fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CGRPA constructs a counterfactual advantage function that isolates individual contributions within group behavior
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A survey on multi-agent reinforcement learning and its application,
Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024
work page 2024
-
[3]
A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,
P. Yadav, A. Mishra, and S. Kim, “A comprehensive survey on multi-agent reinforcement learning for connected and automated vehicles,” Sensors, vol. 23, no. 10, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/10/4710
work page 2023
-
[4]
Y . Zhang, G. Zheng, Z. Liu, Q. Li, and H. Zeng, “Marlens: Understanding multi-agent reinforcement learning for traffic signal control via visual analytics,” IEEE Transactions on Visualization and Computer Graphics , p. 1–16, 2024. [Online]. Available: http: //dx.doi.org/10.1109/TVCG.2024.3392587
-
[5]
Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey,
R. Zhang, J. Hou, F. Walter, S. Gu, J. Guan, F. R ¨ohrbein, Y . Du, P. Cai, G. Chen, and A. Knoll, “Multi-agent reinforcement learning for autonomous driving: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2408.09675
-
[6]
A survey of multi-agent deep reinforcement learning with communication,
C. Zhu, M. Dastani, and S. Wang, “A survey of multi-agent deep reinforcement learning with communication,” Autonomous Agents and Multi-Agent Systems , vol. 38, no. 1, p. 4, 2024. [Online]. Available: https://doi.org/10.1007/s10458-023-09633-6
-
[7]
Deep reinforcement learning: A survey,
X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, B. Dai, and Q. Miao, “Deep reinforcement learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 4, pp. 5064– 5078, 2024
work page 2024
-
[8]
T. Rashid, G. Farquhar, B. Peng, and S. Whiteson, “Weighted qmix: expanding monotonic value function factorisation for deep multi-agent reinforcement learning,” in Proceedings of the 34th International Con- ference on Neural Information Processing Systems , ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020
work page 2020
-
[9]
Episodic multi-agent reinforcement learning with curiosity- driven exploration,
L. Zheng, J. Chen, J. Wang, J. He, Y . Hu, Y . Chen, C. Fan, Y . Gao, and C. Zhang, “Episodic multi-agent reinforcement learning with curiosity- driven exploration,” in Proceedings of the 35th International Conference on Neural Information Processing Systems , ser. NIPS ’21. Red Hook, NY , USA: Curran Associates Inc., 2021
work page 2021
-
[10]
Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,
A. Authors, “Emu: Efficient episodic memory utilization of cooperative multi-agent reinforcement learning,” in International Conference on Learning Representations (ICLR) , 2024, oral Presentation. [Online]. Available: https://iclr.cc/virtual/2024/oral/19766
work page 2024
-
[12]
A survey of reinforcement learning algorithms for dynamically varying environments,
S. Padakandla, “A survey of reinforcement learning algorithms for dynamically varying environments,” ACM Comput. Surv., vol. 54, no. 6, Jul. 2021. [Online]. Available: https://doi.org/10.1145/3459991
-
[13]
A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity
P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A survey of learning in multiagent environments: Dealing with non- stationarity,” 2019. [Online]. Available: https://arxiv.org/abs/1707.09183
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[14]
The StarCraft Multi-Agent Challenge,
M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson, “The starcraft multi-agent challenge,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04043 JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15
-
[15]
SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,
B. Ellis, J. Cook, S. Moalla, M. Samvelyan, M. Sun, A. Mahajan, J. N. Foerster, and S. Whiteson, “SMACv2: An improved benchmark for cooperative multi-agent reinforcement learning,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , vol. 36, 2023, pp. 37 567–37 593. [Online]. Available: https://openreview....
work page 2023
-
[16]
A survey on curriculum learning,
X. Wang, Y . Chen, and W. Zhu, “A survey on curriculum learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 9, pp. 4555–4576, 2022
work page 2022
-
[17]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Counterfactual multi-agent policy gradients,
J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” ser. AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018
work page 2018
-
[19]
Monotonic value function factorisation for deep multi- agent reinforcement learning,
T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi- agent reinforcement learning,” J. Mach. Learn. Res. , vol. 21, no. 1, Jan. 2020
work page 2020
-
[20]
Discriminative experience replay for efficient multi- agent reinforcement learning,
H. Xunhan, Z. Jian, Z. Wengang, F. Ruili, and L. Houqiang, “Discriminative experience replay for efficient multi- agent reinforcement learning,” DeepAI, 2023. [Online]. Avail- able: https://deepai.org/publication/discriminative-experience-replay-/ for-efficient-multi-agent-reinforcement-learning
work page 2023
-
[21]
Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning
G. Papoudakis, F. Christianos, A. Rahman, and S. V . Albrecht, “Dealing with non-stationarity in multi-agent deep reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1906.04737
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[22]
Dealing with non- stationarity in MARL via trust-region decomposition,
W. Li, X. Wang, B. Jin, J. Sheng, and H. Zha, “Dealing with non- stationarity in MARL via trust-region decomposition,” in International Conference on Learning Representations , 2022. [Online]. Available: https://openreview.net/forum?id=XHUxf5aRB3s
work page 2022
-
[23]
J. Wei, L. Wang, X. Tao, H. Hu, and H. Wu, “Tackling non-stationarity in decentralized multi-agent reinforcement learning with prudent q- learning,” in Web Information Systems and Applications , X. Zhao, S. Yang, X. Wang, and J. Li, Eds. Cham: Springer International Publishing, 2022, pp. 403–415
work page 2022
-
[24]
H. Nekoei, A. Badrinaaraayanan, A. Sinha, M. Amini, J. Rajendran, A. Mahajan, and S. Chandar, “Dealing with non-stationarity in decentralized cooperative multi-agent deep reinforcement learning via multi-timescale learning,” in Proceedings of The 2nd Conference on Lifelong Learning Agents , ser. Proceedings of Machine Learning Research, S. Chandar, R. Pas...
work page 2023
-
[25]
Monotonic improvement guarantees under non- stationarity for decentralized PPO,
M. Sun, S. Devlin, J. A. Beck, K. Hofmann, and S. Whiteson, “Monotonic improvement guarantees under non- stationarity for decentralized PPO,” 2022. [Online]. Available: https://openreview.net/forum?id=uHv20yi8saL
work page 2022
-
[26]
Value-decomposition networks for cooperative multi-agent learning based on team reward,
P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proceedings of the 17th Interna- tional Conference on Autonomous Agents and MultiAgent Systems , ser. AAMAS ’18. Richland...
work page 2018
-
[27]
QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning
K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y . Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019. [Online]. Available: https: //arxiv.org/abs/1905.05408
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[28]
Team-wise effective communication in multi- agent reinforcement learning,
M. Yang, K. Zhao, Y . Wang, R. Dong, Y . Du, F. Liu, M. Zhou, and L. H. U, “Team-wise effective communication in multi- agent reinforcement learning,” Autonomous Agents and Multi-Agent Systems, vol. 38, no. 2, p. 36, 2024. [Online]. Available: https: //doi.org/10.1007/s10458-024-09665-6
-
[29]
X. Guo, D. Shi, and W. Fan, “Scalable communication for multi- agent reinforcement learning via transformer-based email mechanism,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, ser. IJCAI ’23, 2023
work page 2023
-
[30]
Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,
X. Wang, X. Li, J. Shao, and J. Zhang, “Ac2c: Adaptively controlled two-hop communication for multi-agent reinforcement learning,” in Pro- ceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, ser. AAMAS ’23. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2023, p. 427–435
work page 2023
-
[31]
Communication in multi-agent reinforcement learning: Intention sharing,
W. Kim, J. Park, and Y . Sung, “Communication in multi-agent reinforcement learning: Intention sharing,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=qpsl2dR9twy
work page 2021
-
[32]
A sequen- tial multi-agent reinforcement learning framework for different action spaces,
S. Tian, M. Yang, R. Xiong, X. He, and S. Rajasegarar, “A sequen- tial multi-agent reinforcement learning framework for different action spaces,” Expert Systems with Applications , vol. 258, p. 125138, 2024
work page 2024
-
[33]
Addressing high-dimensional continuous action space via decomposed discrete policy-critic,
Y . Zhang, J. Sun, G. Wang, and J. Chen, “Addressing high-dimensional continuous action space via decomposed discrete policy-critic,” 2023. [Online]. Available: https://openreview.net/forum?id=blCpfjAeFkn
work page 2023
-
[34]
G. Calzolari, V . Sumathy, C. Kanellakis, and G. Nikolakopoulos, “D- marl: A dynamic communication-based action space enhancement for multi agent reinforcement learning exploration of large scale unknown environments,” in 2024 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) , 2024, pp. 3470–3475
work page 2024
-
[35]
Cooperative modular reinforcement learning for large discrete action space problem,
F. Ming, F. Gao, K. Liu, and C. Zhao, “Cooperative modular reinforcement learning for large discrete action space problem,” Neural Networks , vol. 161, pp. 281–296, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608023000588
work page 2023
-
[36]
Exploration in deep reinforcement learning: From single-agent to multiagent domain,
J. Hao, T. Yang, H. Tang, C. Bai, J. Liu, Z. Meng, P. Liu, and Z. Wang, “Exploration in deep reinforcement learning: From single-agent to multiagent domain,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 7, p. 8762–8782, Jul. 2024. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2023.3236361
-
[37]
A. Borzilov, A. Skrynnik, and A. Panov, “Rethinking exploration and experience exploitation in value-based multi-agent reinforcement learn- ing,” in International Conference on Computational Optimization, 2024. [Online]. Available: https://openreview.net/forum?id=RzoxFLA966
work page 2024
-
[38]
Scalable evaluation of multi-agent reinforcement learning with melting pot,
J. Z. Leibo, E. A. Due ˜nez-Guzman, A. Vezhnevets, J. P. Agapiou, P. Sunehag, R. Koster, J. Matyas, C. Beattie, I. Mordatch, and T. Graepel, “Scalable evaluation of multi-agent reinforcement learning with melting pot,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zh...
work page 2021
-
[39]
Sample-efficient multiagent reinforcement learning with reset replay,
Y . Yang, G. Chen, J. HAO, and P.-A. Heng, “Sample-efficient multiagent reinforcement learning with reset replay,” in Forty-first International Conference on Machine Learning , 2024. [Online]. Available: https://openreview.net/forum?id=w8ei1o9U5y
work page 2024
-
[40]
Self- organized group for cooperative multi-agent reinforcement learning,
J. Shao, Z. Lou, H. Zhang, Y . Jiang, S. He, and X. Ji, “Self- organized group for cooperative multi-agent reinforcement learning,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022
work page 2022
-
[41]
Google research football: A novel reinforcement learning environment,
K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem, L. Espeholt, C. Riquelme, D. Vincent, M. Michalski, O. Bousquet, and S. Gelly, “Google research football: A novel reinforcement learning environment,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 4501–4510, 2020. [Online]. Available: https://doi.org/10.1609/aaa...
-
[42]
Pommerman: A multi-agent playground,
C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, and J. Bruna, “Pommerman: A multi-agent playground,” 2022. [Online]. Available: https://arxiv.org/abs/1809.07124
-
[43]
The neural mmo platform for massively multiagent research,
J. Suarez, Y . Du, C. Zhu, I. Mordatch, and P. Isola, “The neural mmo platform for massively multiagent research,” 2021. [Online]. Available: https://arxiv.org/abs/2110.07594
-
[44]
Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,
G. Papoudakis, F. Christianos, L. Sch ¨afer, and S. V . Albrecht, “Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,” 2021. [Online]. Available: https://arxiv.org/abs/2006. 07869
work page 2021
-
[45]
A survey on multi-agent reinforcement learning and its application,
Z. Ning and L. Xie, “A survey on multi-agent reinforcement learning and its application,” Journal of Automation and Intelligence , vol. 3, no. 2, pp. 73–91, 2024. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S2949855424000042
work page 2024
-
[46]
W. Jin, H. Du, B. Zhao, X. Tian, B. Shi, and G. Yang, “A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives,” 2025. [Online]. Available: https://arxiv.org/abs/2503.13415
-
[47]
F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, ser. SpringerBriefs in Intelligent Systems. Springer Cham, 2016, vol. 1, no. 134, 14 b/w illustrations, 22 colour illustrations. [Online]. Available: https://doi.org/10.1007/978-3-319-28929-8
-
[48]
Cooperative multi-agent control using deep reinforcement learning,
J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in Autonomous Agents and Multiagent Systems , G. Sukthankar and J. A. Rodriguez-Aguilar, Eds. Cham: Springer International Publishing, 2017, pp. 66–83
work page 2017
-
[49]
K. R. Chandra and S. Borugadda, “Multi agent deep reinforcement learning with deep q-network based energy efficiency and resource allocation in noma wireless systems,” in 2023 Second International JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 202...
work page 2023
-
[50]
Deep q-network based multi-agent reinforcement learning with binary action agents,
A. M. Hafiz and G. M. Bhat, “Deep q-network based multi-agent reinforcement learning with binary action agents,” 2020. [Online]. Available: https://arxiv.org/abs/2008.04109
-
[51]
The dynamics of reinforcement learning in cooperative multiagent systems,
C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” in Proceedings of the fifteenth na- tional/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence . AAAI Press, 1998, pp. 746–752
work page 1998
-
[52]
T. Ikeda and T. Shibuya, “Centralized training with decentralized exe- cution reinforcement learning for cooperative multi-agent systems with communication delay,” in 2022 61st Annual Conference of the Society of Instrument and Control Engineers (SICE) , 2022, pp. 135–140
work page 2022
-
[53]
C. Amato, “An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2409.03052
-
[54]
The surprising effectiveness of ppo in cooperative multi-agent games,
C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022
work page 2022
-
[55]
Multi- agent actor-critic for mixed cooperative-competitive environments,
R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- agent actor-critic for mixed cooperative-competitive environments,”
-
[56]
Available: https://arxiv.org/abs/1706.02275
[Online]. Available: https://arxiv.org/abs/1706.02275
-
[57]
Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning , ser. ICML ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380
-
[58]
Self-paced learning for latent variable models,
M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Advances in Neural Information Processing Systems , J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., vol. 23. Curran Associates, Inc.,
-
[59]
[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2010/file/e57c6b956a6521b28495f2886ca0977a-Paper.pdf
work page 2010
-
[60]
TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game
P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y . Zheng, J. Liu, Y . Liu, H. Liu, and T. Zhang, “Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game,” 2018. [Online]. Available: https://arxiv.org/abs/1809.07193
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Qplex: Duplex dueling multi-agent q-learning,
J. Wang, Z. Ren, T. Liu, Y . Yu, and C. Zhang, “Qplex: Duplex dueling multi-agent q-learning,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //openreview.net/forum?id=Rcmk0xxIQV Weiqiang Jin (Student Member, IEEE) is currently a Ph.D. candidate in Electronic and Information En- gineering at Xi’an Jiaotong Univ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.