pith. sign in

arxiv: 2502.03506 · v2 · submitted 2025-02-05 · 💻 cs.MA · cs.LG

Optimistic {ε}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

Pith reviewed 2026-05-23 04:12 UTC · model grok-4.3

classification 💻 cs.MA cs.LG
keywords multi-agent reinforcement learningCTDEexploration strategyoptimistic value networksepsilon-greedycooperative MARLvalue underestimationjoint action selection
0
0 comments X

The pith

Optimistic action-value networks converge in probability to maximum returns and, when sampled with probability ε, raise the frequency of high-return joint actions to prevent suboptimal convergence in CTDE multi-agent RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that insufficient sampling of optimal joint actions during exploration, not only representational limits, drives value underestimation and suboptimal policies in cooperative multi-agent reinforcement learning under the CTDE paradigm. It introduces optimistic action-value networks as separate exploration indicators. These networks are shown to converge in probability to the highest achievable returns. Sampling from their distributions with probability ε then boosts selection of high-return joint actions. Experiments across environments confirm higher final returns, win rates, and faster convergence than prior enhanced methods.

Core claim

Optimistic action-value networks serve as decoupled exploration indicators that converge in probability to the maximum achievable returns; sampling actions from these networks with probability ε increases the selection frequency of high-return joint actions and thereby prevents convergence to suboptimal solutions under standard CTDE training.

What carries the argument

Optimistic action-value networks trained as decoupled exploration indicators that converge in probability to maximum returns.

If this is right

  • Value underestimation from under-sampling is mitigated without changing the monotonic value structure.
  • The frequency of optimal joint actions rises during training, raising final returns and win rates.
  • Convergence speed improves because the algorithm escapes suboptimal joint policies more reliably.
  • The same exploration schedule works across multiple cooperative environments without environment-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Decoupling exploration indicators from the main value estimator may generalize to other CTDE variants that suffer from action-selection bias.
  • The convergence-in-probability argument could be tested by tracking the gap between the optimistic network and the true maximum return over training episodes.
  • If the independence assumption holds, the method could be combined with non-monotonic value factorizations without additional interference.

Load-bearing premise

The optimistic networks can be trained independently without interfering with the main value estimation under the CTDE dynamics and exploration schedule used.

What would settle it

An experiment in which the optimistic networks fail to converge in probability to the maximum returns or in which ε-sampling from them does not measurably increase the frequency of high-return joint actions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.03506 by Ruijie Zhang, Ruoning Zhang, Siying Wang, Stefano V. Albrecht, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang.

Figure 1
Figure 1. Figure 1: Overall framework of Optimistic ϵ-Greedy Exploration maximum value maxa−i ri(ai |a−i) of the corresponding ac￾tion ai ∈ A. This enables the identification of the optimal action a ∗ i by comparing the function values of different ac￾tions. Since the team only receives the total reward feedback in practice, we exploit the joint additivity property of con￾vergence in probability by updating the sum of all opt… view at source ↗
Figure 2
Figure 2. Figure 2: Experimental Results of Matrix Game and Predator Prey [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental Results of StarCraft Multi-Agent Challenge [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, conventional methods based on CTDE can suffer from value underestimation and converge to suboptimal solutions. While such underestimation is typically attributed to the representational limitations of monotonic structures, we provide a novel perspective by demonstrating that the insufficient sampling of optimal joint actions during exploration is also a critical factor. To address this problem, we propose Optimistic $\epsilon$-Greedy Exploration. Our method introduces optimistic action-value networks that serve as decoupled exploration indicators, which we theoretically prove to converge in probability to the maximum achievable returns. By sampling actions from these distributions with a probability of $\epsilon$, we effectively increase the selection frequency of high-return joint actions. Experimental results in various environments reveal that our strategy effectively prevents the algorithm from falling into suboptimal solutions and significantly improves final returns, win rates, and convergence speeds compared to other enhanced algorithms. Our code has been open-sourced at https://github.com/qxqxtxdy/OptimisticExploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that in CTDE-based cooperative MARL, value underestimation and convergence to suboptimal policies arise in part from insufficient sampling of optimal joint actions during exploration. It introduces Optimistic ε-Greedy Exploration, which augments standard methods with decoupled optimistic action-value networks. These networks are proven to converge in probability to the maximum achievable returns; sampling from them with probability ε is shown to increase the frequency of high-return joint actions. Experiments across environments demonstrate improved returns, win rates, and convergence speed relative to baselines, with code released at the cited GitHub repository.

Significance. If the convergence result holds under the actual CTDE training regime and the empirical gains are robust, the work supplies a concrete mechanism for mitigating a previously under-emphasized source of suboptimality in multi-agent exploration. The open-sourced implementation is a positive contribution that enables direct replication and extension.

major comments (3)
  1. [Theoretical analysis] Theoretical analysis section: the stated convergence-in-probability result for the optimistic networks is derived under an assumption of independent updates, yet the manuscript provides no separate analysis showing that this independence is preserved when the optimistic networks are trained jointly with the main value networks under the standard CTDE gradient schedule and non-stationary environment dynamics.
  2. [Implementation and experimental sections] Implementation and experimental sections: the claim that the optimistic networks function as 'decoupled exploration indicators' is load-bearing for the justification of ε-sampling, but no ablation or gradient-flow analysis is reported that confirms the updates remain independent in the actual code; without this, the link between the theoretical guarantee and the observed performance gains is not fully established.
  3. [Experimental results] Experimental results: while gains are reported relative to 'other enhanced algorithms,' the paper does not include controls that isolate the contribution of the optimistic sampling from other implementation choices (e.g., network architecture, replay buffer, or hyper-parameter tuning), making it difficult to attribute improvements specifically to the proposed mechanism.
minor comments (2)
  1. Notation for the optimistic value function and the ε-sampling distribution should be introduced with explicit definitions before the convergence statement to improve readability.
  2. The abstract and introduction both refer to 'various environments' without a consolidated table listing all domains, agent counts, and evaluation metrics; adding such a table would aid comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and commit to revisions that strengthen the connection between theory and implementation.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis section: the stated convergence-in-probability result for the optimistic networks is derived under an assumption of independent updates, yet the manuscript provides no separate analysis showing that this independence is preserved when the optimistic networks are trained jointly with the main value networks under the standard CTDE gradient schedule and non-stationary environment dynamics.

    Authors: We agree that an explicit discussion of independence under joint training would strengthen the manuscript. The optimistic networks use fully separate parameters and an independent loss that receives no gradients from the main value networks; this separation is preserved under the CTDE schedule because the optimistic networks function as auxiliary heads updated on their own schedule. We will revise the theoretical analysis section and add a short appendix note confirming that the joint training dynamics do not violate the independence assumption. revision: yes

  2. Referee: [Implementation and experimental sections] Implementation and experimental sections: the claim that the optimistic networks function as 'decoupled exploration indicators' is load-bearing for the justification of ε-sampling, but no ablation or gradient-flow analysis is reported that confirms the updates remain independent in the actual code; without this, the link between the theoretical guarantee and the observed performance gains is not fully established.

    Authors: We acknowledge the value of explicit verification. In the revised version we will add both a gradient-flow diagram in the implementation section and an ablation that compares runs with and without enforced separation of the optimistic networks. These additions will directly confirm independence in the released code and tighten the link to the theoretical result. revision: yes

  3. Referee: [Experimental results] Experimental results: while gains are reported relative to 'other enhanced algorithms,' the paper does not include controls that isolate the contribution of the optimistic sampling from other implementation choices (e.g., network architecture, replay buffer, or hyper-parameter tuning), making it difficult to attribute improvements specifically to the proposed mechanism.

    Authors: This concern is valid. We will add controlled experiments that keep network architecture, replay buffer, and all hyperparameters fixed while toggling only the optimistic sampling component. The new results will be reported in an expanded experimental section to isolate the contribution of the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical convergence claim is independent of fitted quantities

full rationale

The paper's central derivation is a claimed theoretical proof that optimistic action-value networks converge in probability to maximum returns, used to justify ε-sampling for exploration. This is presented as a first-principles result rather than a fit or renaming of empirical patterns. No equations reduce a prediction to a fitted input by construction, no self-citation chain bears the load of the uniqueness or convergence claim, and the optimistic networks are described as decoupled without the proof depending on the CTDE training dynamics in a self-referential way. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond standard RL assumptions; the optimistic networks are presented as a new component whose convergence is claimed but not derived here.

pith-pipeline@v0.9.0 · 5738 in / 1173 out tokens · 41917 ms · 2026-05-23T04:12:47.407274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 3 internal anchors

  1. [1]

    Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

    [B¨ohmer et al., 2019] Wendelin B¨ohmer, Tabish Rashid, and Shimon Whiteson. Exploration with unreliable intrin- sic reward in multi-agent reinforcement learning. arXiv preprint arXiv:1906.02138,

  2. [2]

    The dynamics of reinforcement learning in cooperative multiagent systems

    [Claus and Boutilier, 1998] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746- 752):2,

  3. [3]

    Stabilising expe- rience replay for deep multi-agent reinforcement learning

    [Foerster et al., 2017] Jakob Foerster, Nantas Nardelli, Gre- gory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising expe- rience replay for deep multi-agent reinforcement learning. In International conference on machine learning , pages 1146–1155. PMLR,

  4. [4]

    Counterfactual multi-agent policy gradients

    [Foerster et al., 2018] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelli- gence, volume 32,

  5. [5]

    Cirs: Bursting filter bubbles by counterfactual interactive recommender system

    [Gao et al., 2023] Chongming Gao, Shiqi Wang, Shijun Li, Jiawei Chen, Xiangnan He, Wenqiang Lei, Biao Li, Yuan Zhang, and Peng Jiang. Cirs: Bursting filter bubbles by counterfactual interactive recommender system. ACM Transactions on Information Systems, 42(1):1–27,

  6. [6]

    Sampling efficient deep reinforcement learning through preference- guided stochastic exploration

    [Huang et al., 2023] Wenhui Huang, Cong Zhang, Jingda Wu, Xiangkun He, Jie Zhang, and Chen Lv. Sampling efficient deep reinforcement learning through preference- guided stochastic exploration. IEEE Transactions on Neu- ral Networks and Learning Systems,

  7. [7]

    Actor- attention-critic for multi-agent reinforcement learning

    [Iqbal and Sha, 2019] Shariq Iqbal and Fei Sha. Actor- attention-critic for multi-agent reinforcement learning. In International conference on machine learning , pages 2961–2970. PMLR,

  8. [8]

    A maximum mutual infor- mation framework for multi-agent reinforcement learning

    [Kim et al., 2020] Woojun Kim, Whiyoung Jung, Myungsik Cho, and Youngchul Sung. A maximum mutual infor- mation framework for multi-agent reinforcement learning. arXiv preprint arXiv:2006.02732,

  9. [9]

    Multi-agent reinforcement learning for traffic signal control: A cooperative approach

    [Kolat et al., 2023] M´at´e Kolat, B´alint K˝ov´ari, Tam´as B´ecsi, and Szil´ard Aradi. Multi-agent reinforcement learning for traffic signal control: A cooperative approach. Sustain- ability, 15(4):3479,

  10. [10]

    A unified game- theoretic approach to multiagent reinforcement learning

    [Lanctot et al., 2017] Marc Lanctot, Vinicius Zambaldi, Au- drunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P´erolat, David Silver, and Thore Graepel. A unified game- theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems , 30,

  11. [11]

    Optimistic value instructors for co- operative multi-agent reinforcement learning

    [Li et al., 2024] Chao Li, Yupeng Zhang, Jianqi Wang, Yu- jing Hu, Shaokang Dong, Wenbin Li, Tangjie Lv, Changjie Fan, and Yang Gao. Optimistic value instructors for co- operative multi-agent reinforcement learning. In Proceed- ings of the AAAI Conference on Artificial Intelligence, vol- ume 38, pages 17453–17460,

  12. [12]

    Markov games as a framework for multi-agent reinforcement learning

    [Littman, 1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Ma- chine learning proceedings 1994 , pages 157–163. Else- vier,

  13. [13]

    Cooperative exploration for multi-agent deep reinforcement learning

    [Liu et al., 2021] Iou-Jen Liu, Unnat Jain, Raymond A Yeh, and Alexander Schwing. Cooperative exploration for multi-agent deep reinforcement learning. In Interna- tional conference on machine learning, pages 6826–6836. PMLR,

  14. [14]

    Multi- agent actor-critic for mixed cooperative-competitive envi- ronments

    [Lowe et al., 2017] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperative-competitive envi- ronments. Advances in neural information processing sys- tems, 30,

  15. [15]

    Likelihood quantile networks for coordinating multi-agent reinforcement learning

    [Lyu and Amato, 2018] Xueguang Lyu and Christopher Am- ato. Likelihood quantile networks for coordinating multi-agent reinforcement learning. arXiv preprint arXiv:1812.06319,

  16. [16]

    Benchmarking multi-agent deep reinforcement learn- ing algorithms in cooperative tasks

    [Papoudakis et al., 2020] Georgios Papoudakis, Filippos Christianos, Lukas Sch ¨afer, and Stefano V Albrecht. Benchmarking multi-agent deep reinforcement learn- ing algorithms in cooperative tasks. arXiv preprint arXiv:2006.07869,

  17. [17]

    Residual q-networks for value function factorizing in multiagent reinforcement learning

    [Pina et al., 2022] Rafael Pina, Varuna De Silva, Joosep Hook, and Ahmet Kondoz. Residual q-networks for value function factorizing in multiagent reinforcement learning. IEEE Transactions on Neural Networks and Learning Sys- tems, 35(2):1534–1544,

  18. [19]

    The starcraft multi-agent challenge

    [Samvelyan et al., 2019b] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043,

  19. [20]

    Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

    [Shalev-Shwartz et al., 2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, rein- forcement learning for autonomous driving.arXiv preprint arXiv:1610.03295,

  20. [21]

    Resq: A residual q function-based approach for multi-agent reinforcement learning value factoriza- tion

    [Shen et al., 2022] Siqi Shen, Mengwei Qiu, Jun Liu, Wei- quan Liu, Yongquan Fu, Xinwang Liu, and Cheng Wang. Resq: A residual q function-based approach for multi-agent reinforcement learning value factoriza- tion. Advances in Neural Information Processing Systems, 35:5471–5483,

  21. [22]

    Qtran: Learn- ing to factorize with transformation for cooperative multi- agent reinforcement learning

    [Son et al., 2019] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learn- ing to factorize with transformation for cooperative multi- agent reinforcement learning. In International conference on machine learning, pages 5887–5896. PMLR,

  22. [23]

    Dfac framework: Factorizing the value function via quantile mixture for multi-agent distribu- tional q-learning

    [Sun et al., 2021] Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. Dfac framework: Factorizing the value function via quantile mixture for multi-agent distribu- tional q-learning. In International Conference on Machine Learning, pages 9945–9954. PMLR,

  23. [24]

    Value-Decomposition Networks For Cooperative Multi-Agent Learning

    [Sunehag et al., 2017] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296,

  24. [25]

    Influence-based multi-agent explo- ration

    [Wang et al., 2019] Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. Influence-based multi-agent explo- ration. arXiv preprint arXiv:1910.05512,

  25. [26]

    Qplex: Duplex dueling multi-agent q-learning

    [Wang et al., 2020] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062,

  26. [27]

    En- hancing collaboration in multi-agent reinforcement learn- ing with correlated trajectories

    [Wang et al., 2024] Siying Wang, Hongfei Du, Yang Zhou, Zhitong Zhao, Ruoning Zhang, and Wenyu Chen. En- hancing collaboration in multi-agent reinforcement learn- ing with correlated trajectories. Knowledge-Based Sys- tems, 305:112665,

  27. [28]

    Fully decentral- ized multi-agent reinforcement learning with networked agents

    [Zhang et al., 2018] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentral- ized multi-agent reinforcement learning with networked agents. In International conference on machine learning, pages 5872–5881. PMLR,

  28. [29]

    Condi- tionally optimistic exploration for cooperative deep multi- agent reinforcement learning

    [Zhao et al., 2023] Xutong Zhao, Yangchen Pan, Chenjun Xiao, Sarath Chandar, and Janarthanan Rajendran. Condi- tionally optimistic exploration for cooperative deep multi- agent reinforcement learning. In Uncertainty in Artificial Intelligence, pages 2529–2540. PMLR,

  29. [30]

    Qdap: Downsizing adaptive policy for cooperative multi-agent reinforcement learning

    [Zhao et al., 2024] Zhitong Zhao, Ya Zhang, Siying Wang, Fan Zhang, Malu Zhang, and Wenyu Chen. Qdap: Downsizing adaptive policy for cooperative multi-agent reinforcement learning. Knowledge-Based Systems , 294:111719,