Optimistic {ε}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning
Pith reviewed 2026-05-23 04:12 UTC · model grok-4.3
The pith
Optimistic action-value networks converge in probability to maximum returns and, when sampled with probability ε, raise the frequency of high-return joint actions to prevent suboptimal convergence in CTDE multi-agent RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimistic action-value networks serve as decoupled exploration indicators that converge in probability to the maximum achievable returns; sampling actions from these networks with probability ε increases the selection frequency of high-return joint actions and thereby prevents convergence to suboptimal solutions under standard CTDE training.
What carries the argument
Optimistic action-value networks trained as decoupled exploration indicators that converge in probability to maximum returns.
If this is right
- Value underestimation from under-sampling is mitigated without changing the monotonic value structure.
- The frequency of optimal joint actions rises during training, raising final returns and win rates.
- Convergence speed improves because the algorithm escapes suboptimal joint policies more reliably.
- The same exploration schedule works across multiple cooperative environments without environment-specific tuning.
Where Pith is reading between the lines
- Decoupling exploration indicators from the main value estimator may generalize to other CTDE variants that suffer from action-selection bias.
- The convergence-in-probability argument could be tested by tracking the gap between the optimistic network and the true maximum return over training episodes.
- If the independence assumption holds, the method could be combined with non-monotonic value factorizations without additional interference.
Load-bearing premise
The optimistic networks can be trained independently without interfering with the main value estimation under the CTDE dynamics and exploration schedule used.
What would settle it
An experiment in which the optimistic networks fail to converge in probability to the maximum returns or in which ε-sampling from them does not measurably increase the frequency of high-return joint actions would falsify the central claim.
Figures
read the original abstract
The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, conventional methods based on CTDE can suffer from value underestimation and converge to suboptimal solutions. While such underestimation is typically attributed to the representational limitations of monotonic structures, we provide a novel perspective by demonstrating that the insufficient sampling of optimal joint actions during exploration is also a critical factor. To address this problem, we propose Optimistic $\epsilon$-Greedy Exploration. Our method introduces optimistic action-value networks that serve as decoupled exploration indicators, which we theoretically prove to converge in probability to the maximum achievable returns. By sampling actions from these distributions with a probability of $\epsilon$, we effectively increase the selection frequency of high-return joint actions. Experimental results in various environments reveal that our strategy effectively prevents the algorithm from falling into suboptimal solutions and significantly improves final returns, win rates, and convergence speeds compared to other enhanced algorithms. Our code has been open-sourced at https://github.com/qxqxtxdy/OptimisticExploration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in CTDE-based cooperative MARL, value underestimation and convergence to suboptimal policies arise in part from insufficient sampling of optimal joint actions during exploration. It introduces Optimistic ε-Greedy Exploration, which augments standard methods with decoupled optimistic action-value networks. These networks are proven to converge in probability to the maximum achievable returns; sampling from them with probability ε is shown to increase the frequency of high-return joint actions. Experiments across environments demonstrate improved returns, win rates, and convergence speed relative to baselines, with code released at the cited GitHub repository.
Significance. If the convergence result holds under the actual CTDE training regime and the empirical gains are robust, the work supplies a concrete mechanism for mitigating a previously under-emphasized source of suboptimality in multi-agent exploration. The open-sourced implementation is a positive contribution that enables direct replication and extension.
major comments (3)
- [Theoretical analysis] Theoretical analysis section: the stated convergence-in-probability result for the optimistic networks is derived under an assumption of independent updates, yet the manuscript provides no separate analysis showing that this independence is preserved when the optimistic networks are trained jointly with the main value networks under the standard CTDE gradient schedule and non-stationary environment dynamics.
- [Implementation and experimental sections] Implementation and experimental sections: the claim that the optimistic networks function as 'decoupled exploration indicators' is load-bearing for the justification of ε-sampling, but no ablation or gradient-flow analysis is reported that confirms the updates remain independent in the actual code; without this, the link between the theoretical guarantee and the observed performance gains is not fully established.
- [Experimental results] Experimental results: while gains are reported relative to 'other enhanced algorithms,' the paper does not include controls that isolate the contribution of the optimistic sampling from other implementation choices (e.g., network architecture, replay buffer, or hyper-parameter tuning), making it difficult to attribute improvements specifically to the proposed mechanism.
minor comments (2)
- Notation for the optimistic value function and the ε-sampling distribution should be introduced with explicit definitions before the convergence statement to improve readability.
- The abstract and introduction both refer to 'various environments' without a consolidated table listing all domains, agent counts, and evaluation metrics; adding such a table would aid comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and commit to revisions that strengthen the connection between theory and implementation.
read point-by-point responses
-
Referee: [Theoretical analysis] Theoretical analysis section: the stated convergence-in-probability result for the optimistic networks is derived under an assumption of independent updates, yet the manuscript provides no separate analysis showing that this independence is preserved when the optimistic networks are trained jointly with the main value networks under the standard CTDE gradient schedule and non-stationary environment dynamics.
Authors: We agree that an explicit discussion of independence under joint training would strengthen the manuscript. The optimistic networks use fully separate parameters and an independent loss that receives no gradients from the main value networks; this separation is preserved under the CTDE schedule because the optimistic networks function as auxiliary heads updated on their own schedule. We will revise the theoretical analysis section and add a short appendix note confirming that the joint training dynamics do not violate the independence assumption. revision: yes
-
Referee: [Implementation and experimental sections] Implementation and experimental sections: the claim that the optimistic networks function as 'decoupled exploration indicators' is load-bearing for the justification of ε-sampling, but no ablation or gradient-flow analysis is reported that confirms the updates remain independent in the actual code; without this, the link between the theoretical guarantee and the observed performance gains is not fully established.
Authors: We acknowledge the value of explicit verification. In the revised version we will add both a gradient-flow diagram in the implementation section and an ablation that compares runs with and without enforced separation of the optimistic networks. These additions will directly confirm independence in the released code and tighten the link to the theoretical result. revision: yes
-
Referee: [Experimental results] Experimental results: while gains are reported relative to 'other enhanced algorithms,' the paper does not include controls that isolate the contribution of the optimistic sampling from other implementation choices (e.g., network architecture, replay buffer, or hyper-parameter tuning), making it difficult to attribute improvements specifically to the proposed mechanism.
Authors: This concern is valid. We will add controlled experiments that keep network architecture, replay buffer, and all hyperparameters fixed while toggling only the optimistic sampling component. The new results will be reported in an expanded experimental section to isolate the contribution of the proposed mechanism. revision: yes
Circularity Check
No circularity: theoretical convergence claim is independent of fitted quantities
full rationale
The paper's central derivation is a claimed theoretical proof that optimistic action-value networks converge in probability to maximum returns, used to justify ε-sampling for exploration. This is presented as a first-principles result rather than a fit or renaming of empirical patterns. No equations reduce a prediction to a fitted input by construction, no self-citation chain bears the load of the uniqueness or convergence claim, and the optimistic networks are described as decoupled without the proof depending on the CTDE training dynamics in a self-referential way. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning
[B¨ohmer et al., 2019] Wendelin B¨ohmer, Tabish Rashid, and Shimon Whiteson. Exploration with unreliable intrin- sic reward in multi-agent reinforcement learning. arXiv preprint arXiv:1906.02138,
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
The dynamics of reinforcement learning in cooperative multiagent systems
[Claus and Boutilier, 1998] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746- 752):2,
work page 1998
-
[3]
Stabilising expe- rience replay for deep multi-agent reinforcement learning
[Foerster et al., 2017] Jakob Foerster, Nantas Nardelli, Gre- gory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising expe- rience replay for deep multi-agent reinforcement learning. In International conference on machine learning , pages 1146–1155. PMLR,
work page 2017
-
[4]
Counterfactual multi-agent policy gradients
[Foerster et al., 2018] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelli- gence, volume 32,
work page 2018
-
[5]
Cirs: Bursting filter bubbles by counterfactual interactive recommender system
[Gao et al., 2023] Chongming Gao, Shiqi Wang, Shijun Li, Jiawei Chen, Xiangnan He, Wenqiang Lei, Biao Li, Yuan Zhang, and Peng Jiang. Cirs: Bursting filter bubbles by counterfactual interactive recommender system. ACM Transactions on Information Systems, 42(1):1–27,
work page 2023
-
[6]
Sampling efficient deep reinforcement learning through preference- guided stochastic exploration
[Huang et al., 2023] Wenhui Huang, Cong Zhang, Jingda Wu, Xiangkun He, Jie Zhang, and Chen Lv. Sampling efficient deep reinforcement learning through preference- guided stochastic exploration. IEEE Transactions on Neu- ral Networks and Learning Systems,
work page 2023
-
[7]
Actor- attention-critic for multi-agent reinforcement learning
[Iqbal and Sha, 2019] Shariq Iqbal and Fei Sha. Actor- attention-critic for multi-agent reinforcement learning. In International conference on machine learning , pages 2961–2970. PMLR,
work page 2019
-
[8]
A maximum mutual infor- mation framework for multi-agent reinforcement learning
[Kim et al., 2020] Woojun Kim, Whiyoung Jung, Myungsik Cho, and Youngchul Sung. A maximum mutual infor- mation framework for multi-agent reinforcement learning. arXiv preprint arXiv:2006.02732,
-
[9]
Multi-agent reinforcement learning for traffic signal control: A cooperative approach
[Kolat et al., 2023] M´at´e Kolat, B´alint K˝ov´ari, Tam´as B´ecsi, and Szil´ard Aradi. Multi-agent reinforcement learning for traffic signal control: A cooperative approach. Sustain- ability, 15(4):3479,
work page 2023
-
[10]
A unified game- theoretic approach to multiagent reinforcement learning
[Lanctot et al., 2017] Marc Lanctot, Vinicius Zambaldi, Au- drunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P´erolat, David Silver, and Thore Graepel. A unified game- theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems , 30,
work page 2017
-
[11]
Optimistic value instructors for co- operative multi-agent reinforcement learning
[Li et al., 2024] Chao Li, Yupeng Zhang, Jianqi Wang, Yu- jing Hu, Shaokang Dong, Wenbin Li, Tangjie Lv, Changjie Fan, and Yang Gao. Optimistic value instructors for co- operative multi-agent reinforcement learning. In Proceed- ings of the AAAI Conference on Artificial Intelligence, vol- ume 38, pages 17453–17460,
work page 2024
-
[12]
Markov games as a framework for multi-agent reinforcement learning
[Littman, 1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Ma- chine learning proceedings 1994 , pages 157–163. Else- vier,
work page 1994
-
[13]
Cooperative exploration for multi-agent deep reinforcement learning
[Liu et al., 2021] Iou-Jen Liu, Unnat Jain, Raymond A Yeh, and Alexander Schwing. Cooperative exploration for multi-agent deep reinforcement learning. In Interna- tional conference on machine learning, pages 6826–6836. PMLR,
work page 2021
-
[14]
Multi- agent actor-critic for mixed cooperative-competitive envi- ronments
[Lowe et al., 2017] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperative-competitive envi- ronments. Advances in neural information processing sys- tems, 30,
work page 2017
-
[15]
Likelihood quantile networks for coordinating multi-agent reinforcement learning
[Lyu and Amato, 2018] Xueguang Lyu and Christopher Am- ato. Likelihood quantile networks for coordinating multi-agent reinforcement learning. arXiv preprint arXiv:1812.06319,
-
[16]
Benchmarking multi-agent deep reinforcement learn- ing algorithms in cooperative tasks
[Papoudakis et al., 2020] Georgios Papoudakis, Filippos Christianos, Lukas Sch ¨afer, and Stefano V Albrecht. Benchmarking multi-agent deep reinforcement learn- ing algorithms in cooperative tasks. arXiv preprint arXiv:2006.07869,
-
[17]
Residual q-networks for value function factorizing in multiagent reinforcement learning
[Pina et al., 2022] Rafael Pina, Varuna De Silva, Joosep Hook, and Ahmet Kondoz. Residual q-networks for value function factorizing in multiagent reinforcement learning. IEEE Transactions on Neural Networks and Learning Sys- tems, 35(2):1534–1544,
work page 2022
-
[19]
The starcraft multi-agent challenge
[Samvelyan et al., 2019b] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043,
-
[20]
Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving
[Shalev-Shwartz et al., 2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, rein- forcement learning for autonomous driving.arXiv preprint arXiv:1610.03295,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
[Shen et al., 2022] Siqi Shen, Mengwei Qiu, Jun Liu, Wei- quan Liu, Yongquan Fu, Xinwang Liu, and Cheng Wang. Resq: A residual q function-based approach for multi-agent reinforcement learning value factoriza- tion. Advances in Neural Information Processing Systems, 35:5471–5483,
work page 2022
-
[22]
[Son et al., 2019] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learn- ing to factorize with transformation for cooperative multi- agent reinforcement learning. In International conference on machine learning, pages 5887–5896. PMLR,
work page 2019
-
[23]
[Sun et al., 2021] Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. Dfac framework: Factorizing the value function via quantile mixture for multi-agent distribu- tional q-learning. In International Conference on Machine Learning, pages 9945–9954. PMLR,
work page 2021
-
[24]
Value-Decomposition Networks For Cooperative Multi-Agent Learning
[Sunehag et al., 2017] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Influence-based multi-agent explo- ration
[Wang et al., 2019] Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. Influence-based multi-agent explo- ration. arXiv preprint arXiv:1910.05512,
-
[26]
Qplex: Duplex dueling multi-agent q-learning
[Wang et al., 2020] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062,
-
[27]
En- hancing collaboration in multi-agent reinforcement learn- ing with correlated trajectories
[Wang et al., 2024] Siying Wang, Hongfei Du, Yang Zhou, Zhitong Zhao, Ruoning Zhang, and Wenyu Chen. En- hancing collaboration in multi-agent reinforcement learn- ing with correlated trajectories. Knowledge-Based Sys- tems, 305:112665,
work page 2024
-
[28]
Fully decentral- ized multi-agent reinforcement learning with networked agents
[Zhang et al., 2018] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentral- ized multi-agent reinforcement learning with networked agents. In International conference on machine learning, pages 5872–5881. PMLR,
work page 2018
-
[29]
Condi- tionally optimistic exploration for cooperative deep multi- agent reinforcement learning
[Zhao et al., 2023] Xutong Zhao, Yangchen Pan, Chenjun Xiao, Sarath Chandar, and Janarthanan Rajendran. Condi- tionally optimistic exploration for cooperative deep multi- agent reinforcement learning. In Uncertainty in Artificial Intelligence, pages 2529–2540. PMLR,
work page 2023
-
[30]
Qdap: Downsizing adaptive policy for cooperative multi-agent reinforcement learning
[Zhao et al., 2024] Zhitong Zhao, Ya Zhang, Siying Wang, Fan Zhang, Malu Zhang, and Wenyu Chen. Qdap: Downsizing adaptive policy for cooperative multi-agent reinforcement learning. Knowledge-Based Systems , 294:111719,
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.