Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies

Josiah P. Hanna; Nicholas E. Corrado

arxiv: 2508.01049 · v2 · pith:ZLMAXRMFnew · submitted 2025-08-01 · 💻 cs.LG

Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies

Nicholas E. Corrado , Josiah P. Hanna This is my paper

Pith reviewed 2026-05-19 01:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-agent reinforcement learningpolicy gradientssampling erroradaptive samplingcentralized training decentralized executioncooperative samplingindependent policiesjoint action distribution

0 comments

The pith

CoSER reduces joint sampling error more efficiently than independent sampling, increasing the reliability of independent policy gradient algorithms in multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independent on-policy policy gradient algorithms for multi-agent reinforcement learning can converge sub-optimally due to sampling error even when individual expected gradients point to the optimum. Finite trajectories from independent action sampling cause the joint data distribution to deviate from the expected joint on-policy distribution, producing inaccurate gradient estimates. The paper introduces CoSER, which uses an adaptive centralized behavior policy to increase sampling probability on under-sampled joint actions. This coordinated selection reduces joint sampling error more efficiently than standard methods and improves the chance that agents converge to an optimal joint policy. Readers should care because it addresses a previously under-recognized source of unreliability in a widely used class of MARL algorithms.

Core claim

Stochasticity in independent action sampling after collecting finite trajectories causes the joint data distribution to deviate from the expected joint on-policy distribution, leading to inaccurate gradient estimates and sub-optimal convergence. CoSER continually adapts a centralized behavior policy to place higher probability on joint actions that are under-sampled with respect to the current joint policy. This reduces joint sampling error and increases the reliability of independent policy gradient algorithms, defined as the probability of converging to an optimal joint policy.

What carries the argument

CoSER (Cooperative Sampling Error Reduction), which adapts a centralized behavior policy during data collection to prioritize under-sampled joint actions in the Centralized Training with Decentralized Execution setting.

Load-bearing premise

A centralized behavior policy can be trained and used to select joint actions without introducing new biases or violating the decentralized execution constraint in a way that invalidates the on-policy gradient estimates.

What would settle it

An experiment showing that CoSER does not reduce measured joint sampling error or does not increase the fraction of runs that reach optimal joint policies compared to independent sampling.

Figures

Figures reproduced from arXiv: 2508.01049 by Josiah P. Hanna, Nicholas E. Corrado.

**Figure 1.** Figure 1: 2 × 2 matrix game where r1, r2 denotes rewards for Agent 1 (r1) and 2 (r2). Under on-policy sampling, the only way to reduce sampling error is to collect more data. Recently, Corrado and Hanna [10] introduced an action sampling algorithm (PROPS) that reduces sampling error more efficiently than standard on-policy sampling. However this work focused on reducing sampling error w.r.t. a single policy in sing… view at source ↗

**Figure 2.** Figure 2: MA-PROPS overview. Rather than collecting data [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A subset of games used in our experiments. All [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sampling error curves over 10 seeds. Takeaway: MA-PROPS reduces joint sampling error faster than PROPS and on-policy sampling even though PROPS often reduces sampling error w.r.t. Agent 1 faster than MA-PROPS. provides only marginal benefit in LBF, no improvement in BoulderPush, and a 20-point drop in GridWorld [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: MAPPO training curves over 100 seeds. Takeaway: MA-PROPS increases success rate 10-20% over on-policy sampling and PROPS. PROPS increases success rate in one task (LBF). (a) BoulderPush (b) LBF [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Sampling error w.r.t. the joint target policy over 100 seeds. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: Original and modified 3 × 3 matrix games. 5We again omit the behavior policy’s dependence on θ for clarity, as θ is fixed during behavior updates. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: All structurally distinct 2 × 2 no-conflict matrix games from Section 11.2.1 of Albrecht et al. [52]. Each cell shows the reward pair (r1, r2) for Agents 1 and 2. In all games, the optimal outcome is (A, A). To ensure the true policy gradient w.r.t. uniformly random policies increases the probability of the optimal outcome, we change the reward associated with the optimal outcome to (4, 5) in games 7-12 an… view at source ↗

**Figure 10.** Figure 10: Joint sampling error during training in GridWorld, Climbing, and Penalty games. Solid [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: MAPPO success rate during training for all structurally distinct [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: IPPO success rate during training for all structurally distinct [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Joint sampling error for MAPPO + PROPS training on all structurally distinct [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Joint sampling error for MAPPO + PROPS training on all structurally distinct [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: IPPO success rate during training for GridWorld, BoulderPush, and LBF. Solid curves [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games, but they are known to converge sub-optimally when each agent's individual policy gradient points away from an optimal joint equilibrium. Going beyond prior work, we observe that sub-optimal convergence can still arise even when the expected individual policy gradients of each agent point toward the optimal joint solution. After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on-policy distribution. This \textit{sampling error} w.r.t. the joint on-policy distribution produces inaccurate gradient estimates that can make agents converge sub-optimally. We hypothesize that joint sampling error can be reduced through coordinated action selection and that doing so will increase the reliability of policy gradient learning in MARL (i.e., the probability of converging to an optimal joint policy). To test this hypothesis, we first introduce an adaptive action sampling approach to reduce joint sampling error in the Centralized Training with Decentralized Execution setting. Our method, Cooperative Sampling Error Reduction (CoSER), continually adapts a centralized behavior policy to place higher probability on joint actions that are under-sampled w.r.t. the current joint policy. We then empirically evaluate CoSER on a diverse set of multi-agent games and demonstrate that (1) CoSER reduces joint sampling error more efficiently than independent on-policy sampling and (2) this reduction increases the reliability of independent policy gradient algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoSER targets joint sampling error in independent MARL via adaptive centralized behavior policy, but the on-policy gradient validity hinges on an unmentioned correction.

read the letter

The core observation is that independent sampling in MARL can produce joint action distributions that deviate from the product of the agents' current policies, even when expected individual gradients point toward the joint optimum. This finite-sample mismatch creates inaccurate updates and hurts convergence reliability. CoSER counters it by training a centralized behavior policy that continually up-weights under-sampled joint actions relative to the evolving joint policy, then uses those samples for data collection in the CTDE setting.

Referee Report

2 major / 2 minor

Summary. The paper claims that independent on-policy policy gradient methods in multi-agent RL can converge sub-optimally due to finite-sample joint sampling error (deviation of realized joint action distribution from the expected product of individual policies), even when expected individual gradients point toward the joint optimum. It introduces CoSER, an adaptive centralized behavior policy in the CTDE setting that up-weights under-sampled joint actions relative to the current joint policy, and reports empirical results showing reduced joint sampling error and higher reliability (probability of reaching optimal joint policies) compared to independent sampling across diverse games.

Significance. If the central claim holds after addressing potential biases in the gradient estimates, the result would be significant for MARL: it isolates joint sampling error as a distinct, fixable source of unreliability in independent PG methods and demonstrates a practical centralized sampling fix that preserves decentralized execution. This could improve training stability in cooperative settings without requiring fully centralized policies or value functions.

major comments (2)

[Abstract and §3 (CoSER description)] Abstract and CoSER method description: the paper states that joint actions are sampled from an adapted centralized behavior policy rather than from the product of the agents' independent policies, yet provides no importance-sampling correction (e.g., ratios ∏ π_i(a_i) / π_b(a)) or other unbiased estimator for the on-policy gradients. Because the data distribution differs from the joint on-policy distribution, standard REINFORCE-style updates applied directly to these trajectories are biased; this is load-bearing for the claim that observed reliability gains are attributable to reduced sampling error rather than to biased updates.
[Empirical evaluation (results tables/figures)] Empirical evaluation sections: the manuscript reports that CoSER increases reliability but does not include error bars, statistical significance tests across random seeds, or controls for confounding factors such as the additional compute or model capacity required to train and query the centralized behavior policy. Without these, it is difficult to assess whether the reliability gains are robust or merely reflect extra resources.

minor comments (2)

[§2 or §3] Notation for the joint policy and behavior policy should be introduced explicitly with equations early in the method section to clarify how the centralized sampler relates to the product of individual policies.
[Abstract and introduction] The abstract and introduction would benefit from a brief statement of the precise on-policy estimator used (e.g., REINFORCE, actor-critic) and whether any off-policy correction is applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we have made or plan to make in the updated version.

read point-by-point responses

Referee: [Abstract and §3 (CoSER description)] Abstract and CoSER method description: the paper states that joint actions are sampled from an adapted centralized behavior policy rather than from the product of the agents' independent policies, yet provides no importance-sampling correction (e.g., ratios ∏ π_i(a_i) / π_b(a)) or other unbiased estimator for the on-policy gradients. Because the data distribution differs from the joint on-policy distribution, standard REINFORCE-style updates applied directly to these trajectories are biased; this is load-bearing for the claim that observed reliability gains are attributable to reduced sampling error rather than to biased updates.

Authors: We appreciate the referee pointing out this critical aspect of the method. The CoSER behavior policy is indeed distinct from the product of individual policies to reduce joint sampling error. To maintain unbiased estimates of the on-policy gradients, we agree that an importance sampling correction is required. In the revised manuscript, we have incorporated the importance sampling ratio ∏_i π_i(a_i) / π_b(a) into the gradient estimator. This ensures that the updates remain unbiased with respect to the joint on-policy distribution while still benefiting from the adaptive sampling that reduces variance in the realized joint distribution. We have updated the method description in Section 3 and added a formal derivation in the appendix demonstrating unbiasedness of the corrected estimator. revision: yes
Referee: [Empirical evaluation (results tables/figures)] Empirical evaluation sections: the manuscript reports that CoSER increases reliability but does not include error bars, statistical significance tests across random seeds, or controls for confounding factors such as the additional compute or model capacity required to train and query the centralized behavior policy. Without these, it is difficult to assess whether the reliability gains are robust or merely reflect extra resources.

Authors: We acknowledge the need for more rigorous statistical analysis in the empirical sections. We have revised the evaluation to include error bars showing the standard error across 20 independent random seeds for all reported reliability metrics. Additionally, we now report results of Wilcoxon signed-rank tests to assess statistical significance of the improvements over baselines. To address potential confounding due to extra compute or capacity, we have included a new set of experiments where we allocate equivalent additional resources to the baseline methods (e.g., by increasing the size of individual policy networks). The results show that the gains from CoSER persist even under these controls. These updates are reflected in the revised Figures 3-5 and a new subsection in Section 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on external validation

full rationale

The paper defines joint sampling error as the deviation between finite-sample joint action frequencies and the product of independent policies, then constructs CoSER as an adaptive centralized behavior policy that up-weights under-sampled joints. The central claims—that CoSER reduces this error more efficiently than independent sampling and thereby increases policy-gradient reliability—are supported by empirical measurements on multiple games rather than by any equation that equates the output to the input by construction. No self-citation chains, fitted parameters renamed as predictions, or uniqueness theorems imported from prior author work appear as load-bearing steps. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that joint sampling error is the dominant source of sub-optimal convergence and that centralized adaptation can be performed without altering the decentralized execution semantics.

pith-pipeline@v0.9.0 · 5800 in / 1139 out tokens · 30358 ms · 2026-05-19T01:00:00.384619+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation.lean; Foundation/RealityFromDistinction.lean washburn_uniqueness_aczel; reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MA-PROPS samples actions from a separate data collection policy and periodically updates this policy to increase the probability of under-sampled joint actions... DKL(πD(·|s)∥π(·|s)) = Op(1/m²) under MA-PROPS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 5 internal anchors

[1]

The dynamics of reinforcement learning in cooperative multiagent systems

Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998

work page 1998
[2]

Contrasting centralized and decentralized critics in multi-agent reinforcement learning

Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402, 2021

work page arXiv 2021
[4]

Pareto actor-critic for equilibrium selection in multi-agent reinforcement learning

Filippos Christianos, Georgios Papoudakis, and Stefano V Albrecht. Pareto actor-critic for equilibrium selection in multi-agent reinforcement learning. arXiv preprint arXiv:2209.14344, 2022

work page arXiv 2022
[5]

Is indepen- dent learning all you need in the starcraft multi-agent challenge?

Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020

work page arXiv 2011
[6]

The surprising effectiveness of ppo in cooperative multi-agent games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural information processing systems, 35:24611–24624, 2022

work page 2022
[7]

Smarts: An open-source scalable multi-agent rl training school for autonomous driving

Ming Zhou, Jun Luo, Julian Villella, Yaodong Yang, David Rusu, Jiayu Miao, Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen, et al. Smarts: An open-source scalable multi-agent rl training school for autonomous driving. In Conference on robot learning, pages 264–285. PMLR, 2021

work page 2021
[8]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[9]

Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Benchmark- ing multi-agent deep reinforcement learning algorithms in cooperative tasks. arXiv preprint arXiv:2006.07869, 2020

work page arXiv 2006
[10]

On-policy policy gradient reinforcement learning without on-policy sampling

Nicholas E Corrado and Josiah P Hanna. On-policy policy gradient reinforcement learning without on-policy sampling. arXiv preprint arXiv:2311.08290, 2023

work page arXiv 2023
[11]

Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning

Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano Albrecht, and Josiah Hanna. Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning. Advances in Neural Information Processing Systems, 35:37376–37388, 2022

work page 2022
[12]

Revar: Strengthening policy evaluation via reduced variance sampling

Subhojyoti Mukherjee, Josiah P Hanna, and Robert D Nowak. Revar: Strengthening policy evaluation via reduced variance sampling. In Uncertainty in Artificial Intelligence , pages 1413–1422. PMLR, 2022

work page 2022
[13]

Variance-optimal augmentation logging for counterfactual evaluation in contextual bandits

Aaron David Tucker and Thorsten Joachims. Variance-optimal augmentation logging for counterfactual evaluation in contextual bandits. arXiv preprint arXiv:2202.01721, 2022

work page arXiv 2022
[14]

Safe exploration for efficient policy evaluation and comparison

Runzhe Wan, Branislav Kveton, and Rui Song. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning , pages 22491–22511. PMLR, 2022

work page 2022
[15]

Active offline policy selection

Ksenia Konyushova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin Paduraru, Daniel J Mankowitz, Misha Denil, and Nando de Freitas. Active offline policy selection. Advances in Neural Information Processing Systems, 34:24631–24644, 2021

work page 2021
[16]

Eligibility traces for off-policy policy evaluation

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000

work page 2000
[17]

Stochastic variance-reduced policy gradient

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In International conference on machine learning, pages 4026–4035. PMLR, 2018. 11

work page 2018
[18]

Policy opti- mization via importance sampling

Alberto Maria Metelli, Matteo Papini, Francesco Faccio, and Marcello Restelli. Policy opti- mization via importance sampling. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[19]

Importance sampling in reinforcement learning with an estimated behavior policy

Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6):1267–1317, 2021

work page 2021
[20]

Pavse, Ishan Durugkar, Josiah P

Brahma S. Pavse, Ishan Durugkar, Josiah P. Hanna, and Peter Stone. Reducing sampling error in batch temporal difference learning. In International Conference on Machine Learning, 2020

work page 2020
[21]

Toward minimax off-policy value estimation

Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015

work page 2015
[22]

Efficient counterfactual learning from bandit feedback

Yusuke Narita, Shota Yasui, and Kohei Yata. Efficient counterfactual learning from bandit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4634–4641, 2019

work page 2019
[23]

Friend-or-foe q-learning in general-sum games

Michael L Littman et al. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001

work page 2001
[24]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21(178):1–51, 2020

work page 2020
[26]

Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning

Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, pages 5887–5896. PMLR, 2019

work page 2019
[27]

Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams

Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 64–69. IEEE, 2007

work page 2007
[28]

Lenient Multi-Agent Deep Reinforcement Learning

Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep reinforcement learning. arXiv preprint arXiv:1707.04402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Optimistic multi-agent policy gradient

Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, and Joni Pajarinen. Optimistic multi-agent policy gradient. arXiv preprint arXiv:2311.01953, 2023

work page arXiv 2023
[30]

Maven: Multi-agent variational exploration

Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent variational exploration. Advances in neural information processing systems, 32, 2019

work page 2019
[31]

Influence-based multi-agent exploration

Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. Influence-based multi-agent exploration. arXiv preprint arXiv:1910.05512, 2019

work page arXiv 1910
[32]

Cooperative exploration for multi-agent deep reinforcement learning

Iou-Jen Liu, Unnat Jain, Raymond A Yeh, and Alexander Schwing. Cooperative exploration for multi-agent deep reinforcement learning. In International conference on machine learning, pages 6826–6836. PMLR, 2021

work page 2021
[33]

E., Tao, W., Wang, Z., et al

Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E Taylor, Wenyuan Tao, Zhen Wang, et al. Pmic: Improving multi-agent reinforcement learning with progressive mutual information collaboration. arXiv preprint arXiv:2203.08553, 2022

work page arXiv 2022
[34]

Uneven: Universal value exploration for multi-agent reinforcement learning

Tarun Gupta, Anuj Mahajan, Bei Peng, Wendelin Böhmer, and Shimon Whiteson. Uneven: Universal value exploration for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 3930–3941. PMLR, 2021. 12

work page 2021
[35]

Episodic multi-agent reinforcement learning with curiosity- driven exploration

Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, and Chongjie Zhang. Episodic multi-agent reinforcement learning with curiosity- driven exploration. Advances in Neural Information Processing Systems, 34:3757–3769, 2021

work page 2021
[36]

Self-motivated multi- agent exploration

Shaowei Zhang, Jiahan Cao, Lei Yuan, Yang Yu, and De-Chuan Zhan. Self-motivated multi- agent exploration. arXiv preprint arXiv:2301.02083, 2023

work page arXiv 2023
[37]

High- dimensional continuous control using generalized advantage estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016

work page 2016
[38]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Reinforcement learning, pages 5–32, 1992

work page 1992
[39]

A natural policy gradient

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001

work page 2001
[40]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[41]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016

work page 1928
[42]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416. PMLR, 2018

work page 2018
[43]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[44]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999
[46]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. Pmlr, 2014

work page 2014
[47]

arXiv preprint arXiv:2109.11251 , year=

Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. arXiv preprint arXiv:2109.11251, 2021

work page arXiv 2021
[48]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[49]

Value-decomposition multi-agent proximal policy optimization

Yanhao Ma and Jie Luo. Value-decomposition multi-agent proximal policy optimization. In 2022 China Automation Congress (CAC), pages 3460–3464. IEEE, 2022

work page 2022
[50]

Heterogeneous-agent reinforcement learning

Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, and Yaodong Yang. Heterogeneous-agent reinforcement learning. Journal of Machine Learning Research, 25(32): 1–67, 2024

work page 2024
[51]

Shared experience actor-critic for multi-agent reinforcement learning

Filippos Christianos, Lukas Schäfer, and Stefano Albrecht. Shared experience actor-critic for multi-agent reinforcement learning. Advances in neural information processing systems, 33: 10707–10717, 2020. 13

work page 2020
[52]

Multi-agent reinforcement learning: Foundations and modern approaches

Stefano V Albrecht, Filippos Christianos, and Lukas Schäfer. Multi-agent reinforcement learning: Foundations and modern approaches. MIT Press, 2024

work page 2024
[53]

Deep coordination graphs

Wendelin Böhmer, Vitaly Kurin, and Shimon Whiteson. Deep coordination graphs. In Interna- tional Conference on Machine Learning, pages 980–991. PMLR, 2020

work page 2020
[54]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[55]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18, 2022. URL http://jmlr.org/papers/v23/21-1342.html. 14 Appendix Table of Contents A Converg...

work page 2022
[56]

DKL(πD(·|s)∥π(·|s)) = Op 1 m2 under MA-PROPS while DKL(πD(·|s)∥π(·|s)) = Op 1 m under on-policy sampling

work page
[57]

DKL(πD,i(·|s)∥πi(·|s)) = Op 1 m2 under MA-PROPS while DKL(πD,i(·|s)∥πi(·|s)) = Op 1 m under on-policy sampling where Op denotes stochastic boundedness. Proof. Since the behavior policy π(a|s) =Qn i=1 πi(a|si) is a single agent mapping joint states to joint actions, we can immediately apply the convergence result from Theorem 3 to obtain the conver- gence ...

work page
[58]

Sampling error curves for RL training in GridWorld, Climbing game, and Penalty game (Fig. 10)

work page
[59]

11) and IPPO (Fig

Training curves for all 21 distinct 2 × 2 no-conflict matrix games using MAPPO (Fig. 11) and IPPO (Fig. 12)

work page
[60]

13 and Fig

Sampling error curves for RL training in all 21 distinct 2 × 2 no-conflict matrix games (Fig. 13 and Fig. 14)

work page
[61]

Training curves for BoulderPush and Level-based foraging tasks using IPPO (Fig. 15). 6Christianos et al. [4] show that on-policy policy gradient algorithms like MAPPO and MAA2C consistently converge suboptimally in the same tasks we consider. 19 Agent 2 A B Agent 1 A 4, 4 3 , 3 B 2, 2 1 , 1 Game 1 Agent 2 A B Agent 1 A 4, 4 3 , 3 B 2, 1 2 , 1 Game 2 Agent...

work page 2048

[1] [1]

The dynamics of reinforcement learning in cooperative multiagent systems

Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998

work page 1998

[2] [2]

Contrasting centralized and decentralized critics in multi-agent reinforcement learning

Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402, 2021

work page arXiv 2021

[3] [4]

Pareto actor-critic for equilibrium selection in multi-agent reinforcement learning

Filippos Christianos, Georgios Papoudakis, and Stefano V Albrecht. Pareto actor-critic for equilibrium selection in multi-agent reinforcement learning. arXiv preprint arXiv:2209.14344, 2022

work page arXiv 2022

[4] [5]

Is indepen- dent learning all you need in the starcraft multi-agent challenge?

Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020

work page arXiv 2011

[5] [6]

The surprising effectiveness of ppo in cooperative multi-agent games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural information processing systems, 35:24611–24624, 2022

work page 2022

[6] [7]

Smarts: An open-source scalable multi-agent rl training school for autonomous driving

Ming Zhou, Jun Luo, Julian Villella, Yaodong Yang, David Rusu, Jiayu Miao, Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen, et al. Smarts: An open-source scalable multi-agent rl training school for autonomous driving. In Conference on robot learning, pages 264–285. PMLR, 2021

work page 2021

[7] [8]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[8] [9]

Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Benchmark- ing multi-agent deep reinforcement learning algorithms in cooperative tasks. arXiv preprint arXiv:2006.07869, 2020

work page arXiv 2006

[9] [10]

On-policy policy gradient reinforcement learning without on-policy sampling

Nicholas E Corrado and Josiah P Hanna. On-policy policy gradient reinforcement learning without on-policy sampling. arXiv preprint arXiv:2311.08290, 2023

work page arXiv 2023

[10] [11]

Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning

Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano Albrecht, and Josiah Hanna. Robust on-policy sampling for data-efficient policy evaluation in reinforcement learning. Advances in Neural Information Processing Systems, 35:37376–37388, 2022

work page 2022

[11] [12]

Revar: Strengthening policy evaluation via reduced variance sampling

Subhojyoti Mukherjee, Josiah P Hanna, and Robert D Nowak. Revar: Strengthening policy evaluation via reduced variance sampling. In Uncertainty in Artificial Intelligence , pages 1413–1422. PMLR, 2022

work page 2022

[12] [13]

Variance-optimal augmentation logging for counterfactual evaluation in contextual bandits

Aaron David Tucker and Thorsten Joachims. Variance-optimal augmentation logging for counterfactual evaluation in contextual bandits. arXiv preprint arXiv:2202.01721, 2022

work page arXiv 2022

[13] [14]

Safe exploration for efficient policy evaluation and comparison

Runzhe Wan, Branislav Kveton, and Rui Song. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning , pages 22491–22511. PMLR, 2022

work page 2022

[14] [15]

Active offline policy selection

Ksenia Konyushova, Yutian Chen, Thomas Paine, Caglar Gulcehre, Cosmin Paduraru, Daniel J Mankowitz, Misha Denil, and Nando de Freitas. Active offline policy selection. Advances in Neural Information Processing Systems, 34:24631–24644, 2021

work page 2021

[15] [16]

Eligibility traces for off-policy policy evaluation

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000

work page 2000

[16] [17]

Stochastic variance-reduced policy gradient

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In International conference on machine learning, pages 4026–4035. PMLR, 2018. 11

work page 2018

[17] [18]

Policy opti- mization via importance sampling

Alberto Maria Metelli, Matteo Papini, Francesco Faccio, and Marcello Restelli. Policy opti- mization via importance sampling. Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[18] [19]

Importance sampling in reinforcement learning with an estimated behavior policy

Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling in reinforcement learning with an estimated behavior policy. Machine Learning, 110(6):1267–1317, 2021

work page 2021

[19] [20]

Pavse, Ishan Durugkar, Josiah P

Brahma S. Pavse, Ishan Durugkar, Josiah P. Hanna, and Peter Stone. Reducing sampling error in batch temporal difference learning. In International Conference on Machine Learning, 2020

work page 2020

[20] [21]

Toward minimax off-policy value estimation

Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015

work page 2015

[21] [22]

Efficient counterfactual learning from bandit feedback

Yusuke Narita, Shota Yasui, and Kohei Yata. Efficient counterfactual learning from bandit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4634–4641, 2019

work page 2019

[22] [23]

Friend-or-foe q-learning in general-sum games

Michael L Littman et al. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001

work page 2001

[23] [24]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [25]

Monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21(178):1–51, 2020

work page 2020

[25] [26]

Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning

Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, pages 5887–5896. PMLR, 2019

work page 2019

[26] [27]

Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams

Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 64–69. IEEE, 2007

work page 2007

[27] [28]

Lenient Multi-Agent Deep Reinforcement Learning

Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-agent deep reinforcement learning. arXiv preprint arXiv:1707.04402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [29]

Optimistic multi-agent policy gradient

Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, and Joni Pajarinen. Optimistic multi-agent policy gradient. arXiv preprint arXiv:2311.01953, 2023

work page arXiv 2023

[29] [30]

Maven: Multi-agent variational exploration

Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent variational exploration. Advances in neural information processing systems, 32, 2019

work page 2019

[30] [31]

Influence-based multi-agent exploration

Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. Influence-based multi-agent exploration. arXiv preprint arXiv:1910.05512, 2019

work page arXiv 1910

[31] [32]

Cooperative exploration for multi-agent deep reinforcement learning

Iou-Jen Liu, Unnat Jain, Raymond A Yeh, and Alexander Schwing. Cooperative exploration for multi-agent deep reinforcement learning. In International conference on machine learning, pages 6826–6836. PMLR, 2021

work page 2021

[32] [33]

E., Tao, W., Wang, Z., et al

Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E Taylor, Wenyuan Tao, Zhen Wang, et al. Pmic: Improving multi-agent reinforcement learning with progressive mutual information collaboration. arXiv preprint arXiv:2203.08553, 2022

work page arXiv 2022

[33] [34]

Uneven: Universal value exploration for multi-agent reinforcement learning

Tarun Gupta, Anuj Mahajan, Bei Peng, Wendelin Böhmer, and Shimon Whiteson. Uneven: Universal value exploration for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 3930–3941. PMLR, 2021. 12

work page 2021

[34] [35]

Episodic multi-agent reinforcement learning with curiosity- driven exploration

Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, and Chongjie Zhang. Episodic multi-agent reinforcement learning with curiosity- driven exploration. Advances in Neural Information Processing Systems, 34:3757–3769, 2021

work page 2021

[35] [36]

Self-motivated multi- agent exploration

Shaowei Zhang, Jiahan Cao, Lei Yuan, Yang Yu, and De-Chuan Zhan. Self-motivated multi- agent exploration. arXiv preprint arXiv:2301.02083, 2023

work page arXiv 2023

[36] [37]

High- dimensional continuous control using generalized advantage estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016

work page 2016

[37] [38]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Reinforcement learning, pages 5–32, 1992

work page 1992

[38] [39]

A natural policy gradient

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001

work page 2001

[39] [40]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[40] [41]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016

work page 1928

[41] [42]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416. PMLR, 2018

work page 2018

[42] [43]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[43] [44]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [45]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999

[45] [46]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. Pmlr, 2014

work page 2014

[46] [47]

arXiv preprint arXiv:2109.11251 , year=

Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. arXiv preprint arXiv:2109.11251, 2021

work page arXiv 2021

[47] [48]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[48] [49]

Value-decomposition multi-agent proximal policy optimization

Yanhao Ma and Jie Luo. Value-decomposition multi-agent proximal policy optimization. In 2022 China Automation Congress (CAC), pages 3460–3464. IEEE, 2022

work page 2022

[49] [50]

Heterogeneous-agent reinforcement learning

Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, and Yaodong Yang. Heterogeneous-agent reinforcement learning. Journal of Machine Learning Research, 25(32): 1–67, 2024

work page 2024

[50] [51]

Shared experience actor-critic for multi-agent reinforcement learning

Filippos Christianos, Lukas Schäfer, and Stefano Albrecht. Shared experience actor-critic for multi-agent reinforcement learning. Advances in neural information processing systems, 33: 10707–10717, 2020. 13

work page 2020

[51] [52]

Multi-agent reinforcement learning: Foundations and modern approaches

Stefano V Albrecht, Filippos Christianos, and Lukas Schäfer. Multi-agent reinforcement learning: Foundations and modern approaches. MIT Press, 2024

work page 2024

[52] [53]

Deep coordination graphs

Wendelin Böhmer, Vitaly Kurin, and Shimon Whiteson. Deep coordination graphs. In Interna- tional Conference on Machine Learning, pages 980–991. PMLR, 2020

work page 2020

[53] [54]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[54] [55]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18, 2022. URL http://jmlr.org/papers/v23/21-1342.html. 14 Appendix Table of Contents A Converg...

work page 2022

[55] [56]

DKL(πD(·|s)∥π(·|s)) = Op 1 m2 under MA-PROPS while DKL(πD(·|s)∥π(·|s)) = Op 1 m under on-policy sampling

work page

[56] [57]

DKL(πD,i(·|s)∥πi(·|s)) = Op 1 m2 under MA-PROPS while DKL(πD,i(·|s)∥πi(·|s)) = Op 1 m under on-policy sampling where Op denotes stochastic boundedness. Proof. Since the behavior policy π(a|s) =Qn i=1 πi(a|si) is a single agent mapping joint states to joint actions, we can immediately apply the convergence result from Theorem 3 to obtain the conver- gence ...

work page

[57] [58]

Sampling error curves for RL training in GridWorld, Climbing game, and Penalty game (Fig. 10)

work page

[58] [59]

11) and IPPO (Fig

Training curves for all 21 distinct 2 × 2 no-conflict matrix games using MAPPO (Fig. 11) and IPPO (Fig. 12)

work page

[59] [60]

13 and Fig

Sampling error curves for RL training in all 21 distinct 2 × 2 no-conflict matrix games (Fig. 13 and Fig. 14)

work page

[60] [61]

Training curves for BoulderPush and Level-based foraging tasks using IPPO (Fig. 15). 6Christianos et al. [4] show that on-policy policy gradient algorithms like MAPPO and MAA2C consistently converge suboptimally in the same tasks we consider. 19 Agent 2 A B Agent 1 A 4, 4 3 , 3 B 2, 2 1 , 1 Game 1 Agent 2 A B Agent 1 A 4, 4 3 , 3 B 2, 1 2 , 1 Game 2 Agent...

work page 2048