Focusing Influence Mechanism for Multi-Agent Reinforcement Learning

Seungyul Han; Sunwoo Lee; Yisak Park

arxiv: 2506.19417 · v2 · submitted 2025-06-24 · 💻 cs.LG · cs.MA

Focusing Influence Mechanism for Multi-Agent Reinforcement Learning

Yisak Park , Sunwoo Lee , Seungyul Han This is my paper

Pith reviewed 2026-05-19 07:14 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords multi-agent reinforcement learningsparse rewardscoordinated explorationinfluence mechanismeligibility tracesentropy criterionMARL benchmarks

0 comments

The pith

A focusing influence mechanism helps multi-agent systems explore sparse-reward environments more effectively by directing attention to under-explored state space regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Focusing Influence Mechanism to solve the problem of uncoordinated exploration in cooperative multi-agent reinforcement learning when rewards are extremely sparse. Agents typically spread their actions too thinly across the state space and fail to align on the same promising areas long enough to discover joint successes. FIM introduces an entropy-based criterion that highlights under-explored regions and pairs it with eligibility traces so that multiple agents can maintain consistent influence on those regions when it helps coordination. The result is more structured joint behavior and measurable gains over strong baselines on standard MARL benchmarks.

Core claim

The Focusing Influence Mechanism encourages agents to direct their influence toward under-explored parts of the state space using an entropy criterion, and employs eligibility traces to allow multiple agents to sustain and align their influence on the same regions, thereby enabling coordinated exploration even when rewards are extremely sparse.

What carries the argument

The Focusing Influence Mechanism (FIM), which combines an entropy-based criterion for identifying under-explored states with eligibility traces for persistent alignment among agents.

Load-bearing premise

An entropy-based criterion combined with eligibility traces will successfully encourage agents to focus and sustain influence on the same under-explored state space parts in a way that improves coordination without negative side effects or heavy environment-specific adjustments.

What would settle it

Running the same MARL benchmarks with the entropy criterion or eligibility traces removed and observing whether performance gains disappear while other components remain unchanged.

Figures

Figures reproduced from arXiv: 2506.19417 by Seungyul Han, Sunwoo Lee, Yisak Park.

**Figure 2.** Figure 2: (a–b) Empirical distribution pˆ of temporal changes ∆(d) for agent and box positions, averaged over x, y axes. (c) Entropy H(d) for each of the 8 state dimensions: (x, y) positions of agent A, agent B, box A, and box B in the Push-2-Box environment. We set the threshold δ to 0.1. To address the challenge presented in Section 4.1, we introduce a principled approach for identifying Center of Gravity (CoG) st… view at source ↗

**Figure 3.** Figure 3: Empirical distribution pˆ of temporal changes ∆(d) for CoG dimensions (box positions): (a–b): Without AFI. (c–d): With AFI, where Box A is the focused target. While the proposed SFI enables agents to identify critical CoG state dimensions, coordination often becomes unstable when multiple such dimensions are present. Agents may alternate attention across them without maintaining focus, leading to scattered… view at source ↗

**Figure 4.** Figure 4: Performance comparison across the proposed focusing components on Push-2-Box environment [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison on SMAC environments [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison on GRF environments [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Trajectory analysis in 3s_vs_5z: (a) Entropy H(d) with selected CoG state dimensions (b) Changes in enemy health and its trace e d t (c) Rendered frames for highlighting agents’ coordination. academy_3_vs_2, academy_4_vs_3, academy_counterattack) and their corresponding fullfield versions, which are more challenging due to the increased field size. Further environment details and visualizations are provid… view at source ↗

**Figure 8.** Figure 8: Ablation studies on SMAC 3s_vs_5z how eligibility traces evolve on enemy health dimensions during an episode, and [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of initial timestep in SMAC scenarios. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of initial agent positions in GRF scenarios. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: SMAC [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: GRF H(d) visualization E Comparison of Computational Complexity FIM computes intrinsic rewards by estimating each agent’s influence through counterfactual marginalization over its action set A for every dimension in CoGδ. This results in a space complexity of O(|N | · |A| · |CoGδ|) per timestep, while the time complexity remains O(1) due to GPU parallelization. FIM uses a lightweight three-layer multila… view at source ↗

**Figure 13.** Figure 13: FIM trajectory in GRF academy_3_vs_2_full_field In GRF, the CoG state dimensions identified by FIM primarily correspond to the position of the opponent goalkeeper, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Alternatives to SFI Scenario SFI EFI LFI 3s_vs_5z enemy health enemy health, enemy shield, enemy position enemy health 3s5z_vs_3s6z enemy health, enemy shield, ally health, ally shield, ally unit type enemy health, enemy shield, enemy position enemy health, enemy position, ally position academy_3_vs_2 goalkeeper position, goalkeeper direction opponent position, opponent direction, ball position, ball dire… view at source ↗

**Figure 15.** Figure 15: compares four variants: vanilla QMIX, QMIX with state focusing influence (SFI), QMIX with agent focusing influence (AFI), and the full FIM framework that integrates both components. In SMAC, focused fire emerges as a key cooperative strategy, where agents coordinate to attack a single enemy unit at a time. SFI supports this behavior by directing influence toward task-relevant CoG dimensions, such as enemy… view at source ↗

**Figure 16.** Figure 16: Effect of η 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Effect of α λ Effect Analysis We examine the effect of the trace decay factor λ by varying it across λ ∈ {0.8, 0.85, 0.9, 0.95, 1}. The parameter λ determines how long the influence of past actions persists in the eligibility trace. As shown in [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Effect of λ 25 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

read the original abstract

Cooperative multi-agent reinforcement learning (MARL) under sparse rewards remains fundamentally challenging because agents often fail to concentrate their influence, leading to insufficiently coordinated exploration. To address this, we propose the Focusing Influence Mechanism (FIM), a framework that encourages agents to focus their influence on under-explored parts of the state space through an entropy-based criterion, while leveraging eligibility traces to enable multiple agents to consistently align and sustain their influence on the same parts of the state space when beneficial, thereby promoting coordinated and persistent joint behavior. By emphasizing under-explored regions of the state space, FIM facilitates more efficient and structured exploration even under extremely sparse rewards. Across diverse MARL benchmarks, FIM consistently improves cooperative performance over strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FIM pairs entropy-based focus with eligibility traces to push coordinated exploration in sparse-reward MARL, but the reported gains rest on thin evidence and untested assumptions about reliable joint-state identification.

read the letter

The main point is that the paper introduces FIM to get multiple agents to concentrate influence on under-explored joint states via an entropy criterion and then hold that focus with eligibility traces. It claims this yields better cooperative performance than strong baselines on MARL benchmarks under sparse rewards. That combination is the actual new piece; prior work has used entropy for exploration and traces for credit, but tying them specifically to multi-agent influence alignment in this way is not standard. The framing of the coordination problem is clear and the high-level mechanism is straightforward to understand. The empirical section apparently shows consistent improvements, which at least suggests the idea is not obviously broken in the tested environments. The paper earns credit for targeting a real practical issue rather than adding another generic regularizer. The soft spots are more substantial. The abstract and description give no concrete details on how entropy is estimated under function approximation, how traces are shared or decayed across agents, or whether the method requires per-environment tuning of the focus threshold. In high-dimensional non-stationary joint spaces those estimates are typically noisy, and the stress-test concern about unreliable propagation of influence looks plausible unless the full experiments include ablations or variance analysis that directly address it. Without statistical tests, failure-case discussion, or comparison to simpler entropy bonuses, it is hard to tell whether the gains are robust or mainly from extra hyperparameters. This paper is for MARL researchers working on exploration and coordination in sparse settings. A reader already running multi-agent experiments could pick up the mechanism and test it quickly. It deserves a serious referee to check the implementation and results, even though the current write-up would likely come back with requests for more analysis and controls.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Focusing Influence Mechanism (FIM) for cooperative multi-agent reinforcement learning under sparse rewards. FIM uses an entropy-based criterion to direct agents toward under-explored regions of the joint state space and eligibility traces to sustain aligned influence across agents on those regions, with the goal of enabling more structured and persistent coordinated exploration. The central claim is that this yields consistent performance gains over strong baselines on diverse MARL benchmarks.

Significance. If the empirical claims are substantiated with rigorous evidence, FIM could provide a practical addition to the MARL toolkit for sparse-reward settings by repurposing entropy and eligibility traces to promote joint focus without introducing many new hyperparameters. The approach is grounded in standard RL primitives, which is a strength if it produces reproducible gains across environments.

major comments (2)

Abstract and §4 (Experiments): the assertion of 'consistent improvements over strong baselines' is presented without any reported statistical tests, variance across seeds, ablation results, or environment-specific implementation details. This leaves the central performance claim unsupported by the information provided.
§3 (FIM description): the entropy criterion is stated to flag under-explored joint regions and eligibility traces are said to enable sustained multi-agent alignment, yet no analysis is given of how these components behave under function approximation in high-dimensional, non-stationary joint state spaces. The skeptic concern that noisy entropy estimates and changing policies could break reliable coordination is not addressed with a concrete test or counter-example.

minor comments (1)

Notation for the entropy term and trace decay parameter should be introduced with explicit equations and distinguished from standard single-agent usage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the empirical support and analysis.

read point-by-point responses

Referee: Abstract and §4 (Experiments): the assertion of 'consistent improvements over strong baselines' is presented without any reported statistical tests, variance across seeds, ablation results, or environment-specific implementation details. This leaves the central performance claim unsupported by the information provided.

Authors: We agree that the performance claims require stronger statistical backing. In the revised version, we will report mean returns with standard deviations across multiple random seeds (at least five per environment), include pairwise statistical significance tests (e.g., Welch’s t-test with p-values), add ablation studies isolating the entropy criterion and eligibility traces, and expand the implementation details for each benchmark to specify network architectures, hyperparameters, and training protocols. These additions will directly support the central claim. revision: yes
Referee: §3 (FIM description): the entropy criterion is stated to flag under-explored joint regions and eligibility traces are said to enable sustained multi-agent alignment, yet no analysis is given of how these components behave under function approximation in high-dimensional, non-stationary joint state spaces. The skeptic concern that noisy entropy estimates and changing policies could break reliable coordination is not addressed with a concrete test or counter-example.

Authors: We acknowledge the value of explicit analysis under function approximation. While the empirical results were obtained with neural-network approximators in high-dimensional environments and show consistent gains, we will insert a new discussion subsection in §3 that examines the effects of noisy entropy estimates and policy non-stationarity on coordination. We will also add a brief sensitivity study and a simple illustrative counter-example in the appendix to address the concern about potential breakdown of alignment. revision: partial

Circularity Check

0 steps flagged

No circularity; mechanism uses standard RL primitives with external benchmark validation

full rationale

The paper introduces FIM by combining an entropy-based criterion for identifying under-explored state regions with eligibility traces to sustain multi-agent influence alignment. No equations, derivations, or claims in the abstract reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance improvements are asserted via consistent gains on diverse MARL benchmarks against strong baselines, providing independent empirical grounding outside any internal fitting loop. This satisfies the default expectation of a non-circular proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described. The approach builds on standard concepts of entropy and eligibility traces from reinforcement learning, but implementation details that might introduce custom parameters are not available.

pith-pipeline@v0.9.0 · 5646 in / 1179 out tokens · 45684 ms · 2026-05-19T07:14:04.644426+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a state focusing influence (SFI) mechanism that guides agents to increase their impact on such dimensions. ... H(d) = E_ρ [ -log p̂(Δd(st,st+1) | st) ] ... CoG_δ = {d | 0 < H(d) < δ}
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ed_t = λ · ed_{t-1} + η · Inf^d_t ... r_int,t = Σ wd · Inf^d_t · clip(ed_{t-1},1,cmax)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

The IBAL framework builds information-theoretic attacks that break agent interactions in MARL and trains policies to stay robust under observation and action perturbations.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Multi-agent reinforcement learning for redundant robot control in task-space

Adolfo Perrusquía, Wen Yu, and Xiaoou Li. Multi-agent reinforcement learning for redundant robot control in task-space. International Journal of Machine Learning and Cybernetics, 12: 231–241, 2021

work page 2021
[3]

Grandmaster level in starcraft ii using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. nature, 575(7782):350–354, 2019

work page 2019
[4]

A concise introduction to decentralized POMDPs, volume 1

Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016

work page 2016
[5]

Optimal and approximate q-value functions for decentralized pomdps

Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289–353, 2008

work page 2008
[6]

The surprising effectiveness of ppo in cooperative multi-agent games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and YI WU. The surprising effectiveness of ppo in cooperative multi-agent games. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24611–24624. Curran Associates, Inc., 2022

work page 2022
[7]

Value- decomposition networks for cooperative multi-agent learning based on team reward

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning based on team reward. In Pro- ceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems,...

work page 2085
[8]

Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 4295–4304. PMLR, 2018

work page 2018
[9]

Qplex: Duplex dueling multi-agent q-learning

Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. In International Conference on Learning Representations

work page
[10]

Social influence as intrinsic motivation for multi-agent deep reinforcement learning

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, pages 3040–3049. PMLR, 2019

work page 2019
[11]

Influence-based multi-agent exploration

Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. Influence-based multi-agent exploration. In International Conference on Learning Representations, 2020

work page 2020
[12]

Cooperative exploration for multi-agent deep reinforcement learning

Iou-Jen Liu, Unnat Jain, Raymond A Yeh, and Alexander Schwing. Cooperative exploration for multi-agent deep reinforcement learning. In International conference on machine learning, pages 6826–6836. PMLR, 2021

work page 2021
[13]

Episodic multi-agent reinforcement learning with curiosity- driven exploration

Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, and Chongjie Zhang. Episodic multi-agent reinforcement learning with curiosity- driven exploration. Advances in Neural Information Processing Systems, 34:3757–3769, 2021

work page 2021
[14]

Celebrating diversity in shared multi-agent reinforcement learning

Chenghao Li, Tonghan Wang, Chengjie Wu, Qianchuan Zhao, Jun Yang, and Chongjie Zhang. Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Infor- mation Processing Systems, 34:3991–4002, 2021

work page 2021
[15]

Clausewitz’s center of gravity: It’s not what we thought

Antulio J Echevarria. Clausewitz’s center of gravity: It’s not what we thought. Naval War College Review, 56(1):108–123, 2003

work page 2003
[16]

Coordinated exploration via intrinsic rewards for multi-agent rein- forcement learning

Shariq Iqbal and Fei Sha. Coordinated exploration via intrinsic rewards for multi-agent rein- forcement learning. arXiv preprint arXiv:1905.12127, 2019. 10

work page arXiv 1905
[17]

Two heads are better than one: A simple exploration framework for efficient multi-agent reinforcement learning

Jiahui Li, Kun Kuang, Baoxiang Wang, Xingchen Li, Fei Wu, Jun Xiao, and Long Chen. Two heads are better than one: A simple exploration framework for efficient multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36:20038–20053, 2023

work page 2023
[18]

Self-motivated multi- agent exploration

Shaowei Zhang, Jiahan Cao, Lei Yuan, Yang Yu, and De-Chuan Zhan. Self-motivated multi- agent exploration. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 476–484, 2023

work page 2023
[19]

Cmbe: Curiosity-driven model-based exploration for multi-agent reinforcement learning in sparse reward settings

Kai Yang, Zhirui Fang, Xiu Li, and Jian Tao. Cmbe: Curiosity-driven model-based exploration for multi-agent reinforcement learning in sparse reward settings. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2024

work page 2024
[20]

Population-based diverse exploration for sparse-reward multi-agent tasks

Pei Xu, Junge Zhang, and Kaiqi Huang. Population-based diverse exploration for sparse-reward multi-agent tasks. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 283–291, 2024

work page 2024
[21]

Expode: Exploiting policy discrepancy for efficient exploration in multi-agent reinforcement learning

Yucong Zhang and Chao Yu. Expode: Exploiting policy discrepancy for efficient exploration in multi-agent reinforcement learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 58–66, 2023

work page 2023
[22]

Toward efficient multi-agent exploration with trajectory entropy maximization

Tianxu Li and Kun Zhu. Toward efficient multi-agent exploration with trajectory entropy maximization. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[23]

Learning joint behaviors with large variations

Tianxu Li and Kun Zhu. Learning joint behaviors with large variations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23249–23257, 2025

work page 2025
[24]

Maven: Multi-agent variational exploration

Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent variational exploration. Advances in neural information processing systems, 32, 2019

work page 2019
[25]

Fox: Formation-aware exploration in multi-agent reinforcement learning

Yonghyeon Jo, Sunwoo Lee, Junghyuk Yeom, and Seungyul Han. Fox: Formation-aware exploration in multi-agent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12985–12994, 2024

work page 2024
[26]

Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction

Hongyao Tang, Jianye Hao, Tangjie Lv, Yingfeng Chen, Zongzhang Zhang, Hangtian Jia, Chunxu Ren, Yan Zheng, Zhaopeng Meng, Changjie Fan, et al. Hierarchical deep multiagent reinforcement learning with temporal abstraction. arXiv preprint arXiv:1809.09332, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Maser: Multi-agent rein- forcement learning with subgoals generated from experience replay buffer

Jeewon Jeon, Woojun Kim, Whiyoung Jung, and Youngchul Sung. Maser: Multi-agent rein- forcement learning with subgoals generated from experience replay buffer. In International conference on machine learning, pages 10041–10052. PMLR, 2022

work page 2022
[28]

Subspace-aware exploration for sparse-reward multi-agent tasks

Pei Xu, Junge Zhang, Qiyue Yin, Chao Yu, Yaodong Yang, and Kaiqi Huang. Subspace-aware exploration for sparse-reward multi-agent tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11717–11725, 2023

work page 2023
[29]

Saeir: sequentially accumulated entropy intrinsic reward for cooperative multi-agent reinforcement learning with sparse reward

Xin He, Hongwei Ge, Yaqing Hou, and Jincheng Yu. Saeir: sequentially accumulated entropy intrinsic reward for cooperative multi-agent reinforcement learning with sparse reward. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 4107–4115, 2024

work page 2024
[30]

Elign: Expectation alignment as a multi-agent intrinsic reward.Advances in Neural Information Processing Systems, 35:8304–8317, 2022

Zixian Ma, Rose Wang, Fei-Fei Li, Michael Bernstein, and Ranjay Krishna. Elign: Expectation alignment as a multi-agent intrinsic reward.Advances in Neural Information Processing Systems, 35:8304–8317, 2022

work page 2022
[31]

Pmic: Improving multi-agent reinforcement learning with progressive mutual information collaboration

Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E Taylor, Wenyuan Tao, and Zhen Wang. Pmic: Improving multi-agent reinforcement learning with progressive mutual information collaboration. In International Conference on Machine Learning, pages 12979–12997. PMLR, 2022

work page 2022
[32]

Cooperative multiagent learning and exploration with min–max intrinsic motivation

Yaqing Hou, Jie Kang, Haiyin Piao, Yifeng Zeng, Yew-Soon Ong, Yaochu Jin, and Qiang Zhang. Cooperative multiagent learning and exploration with min–max intrinsic motivation. IEEE Transactions on Cybernetics, 2025. 11

work page 2025
[33]

Learning individually inferred communication for multi-agent cooperation

Ziluo Ding, Tiejun Huang, and Zongqing Lu. Learning individually inferred communication for multi-agent cooperation. Advances in neural information processing systems, 33:22069–22079, 2020

work page 2020
[34]

Learning with opponent-learning awareness

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages 122–130, 2018

work page 2018
[35]

Stable opponent shaping in differentiable games

Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations, 2019

work page 2019
[36]

Learning latent representations to influence multi-agent interaction

Annie Xie, Dylan Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. In Conference on robot learning, pages 575–588. PMLR, 2021

work page 2021
[37]

Influencing long-term behavior in multiagent reinforcement learning

Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob Foerster, Michael Everett, Chuangchuang Sun, Gerald Tesauro, and Jonathan P How. Influencing long-term behavior in multiagent reinforcement learning. Advances in Neural Information Processing Systems, 35:18808–18821, 2022

work page 2022
[38]

Imagine, initialize, and explore: An effective exploration method in multi-agent reinforcement learning

Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, and Xuguang Lan. Imagine, initialize, and explore: An effective exploration method in multi-agent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17487–17495, 2024

work page 2024
[39]

Settling decentralized multi-agent coordinated exploration by novelty sharing

Haobin Jiang, Ziluo Ding, and Zongqing Lu. Settling decentralized multi-agent coordinated exploration by novelty sharing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17444–17452, 2024

work page 2024
[40]

Learning to incentivize other learning agents

Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, and Hongyuan Zha. Learning to incentivize other learning agents. Advances in Neural Information Processing Systems, 33:15208–15219, 2020

work page 2020
[41]

Learning to penalize other learning agents

Kyrill Schmid, Lenz Belzner, and Claudia Linnhoff-Popien. Learning to penalize other learning agents. In Artificial Life Conference Proceedings 33, volume 2021, page 59. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2021

work page 2021
[42]

Reciprocal reward influence encourages cooperation from self-interested agents

John L Zhou, Weizhe Hong, and Jonathan Kao. Reciprocal reward influence encourages cooperation from self-interested agents. Advances in Neural Information Processing Systems, 37:59491–59512, 2024

work page 2024
[43]

Lazy agents: A new perspective on solving sparse reward problem in multi-agent reinforcement learning

Boyin Liu, Zhiqiang Pu, Yi Pan, Jianqiang Yi, Yanyan Liang, and Du Zhang. Lazy agents: A new perspective on solving sparse reward problem in multi-agent reinforcement learning. In International Conference on Machine Learning, pages 21937–21950. PMLR, 2023

work page 2023
[44]

Individual contributions as intrinsic explo- ration scaffolds for multi-agent reinforcement learning

Xinran Li, Zifan Liu, Shibo Chen, and Jun Zhang. Individual contributions as intrinsic explo- ration scaffolds for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 28387–28402. PMLR, 2024

work page 2024
[45]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[46]

On the use and misuse of absorbing states in multi-agent reinforcement learning

Andrew Cohen, Ervin Teng, Vincent-Pierre Berges, Ruo-Ping Dong, Hunter Henry, Marwan Mattar, Alexander Zook, and Sujoy Ganguly. On the use and misuse of absorbing states in multi-agent reinforcement learning. arXiv preprint arXiv:2111.05992, 2021

work page arXiv 2021
[47]

Towards un- derstanding cooperative multi-agent q-learning with value factorization

Jianhao Wang, Zhizhou Ren, Beining Han, Jianing Ye, and Chongjie Zhang. Towards un- derstanding cooperative multi-agent q-learning with value factorization. Advances in Neural Information Processing Systems, 34:29142–29155, 2021. 12

work page 2021
[48]

Global rewards in multi-agent deep reinforcement learning for autonomous mobility on demand systems

Heiko Hoppe, Tobias Enders, Quentin Cappart, and Maximilian Schiffer. Global rewards in multi-agent deep reinforcement learning for autonomous mobility on demand systems. In 6th Annual Learning for Dynamics & Control Conference, pages 260–272. PMLR, 2024

work page 2024
[49]

Pac: Assisted value factorization with coun- terfactual predictions in multi-agent reinforcement learning

Hanhan Zhou, Tian Lan, and Vaneet Aggarwal. Pac: Assisted value factorization with coun- terfactual predictions in multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 35:15757–15769, 2022

work page 2022
[50]

Aligning credit for multi-agent co- operation via model-based counterfactual imagination

Jiajun Chai, Yuqian Fu, Dongbin Zhao, and Yuanheng Zhu. Aligning credit for multi-agent co- operation via model-based counterfactual imagination. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pages 281–289, 2024

work page 2024
[51]

Shapley q-value: A local reward approach to solve global reward games

Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7285–7292, 2020

work page 2020
[52]

Shapley counterfactual credits for multi-agent reinforcement learning

Jiahui Li, Kun Kuang, Baoxiang Wang, Furui Liu, Long Chen, Fei Wu, and Jun Xiao. Shapley counterfactual credits for multi-agent reinforcement learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 934–942, 2021

work page 2021
[53]

Shaq: Incorporating shapley value theory into multi-agent q-learning

Jianhong Wang, Yuan Zhang, Yunjie Gu, and Tae-Kyun Kim. Shaq: Incorporating shapley value theory into multi-agent q-learning. Advances in Neural Information Processing Systems, 35:5941–5954, 2022

work page 2022
[54]

Counterfactual conservative q learning for offline multi-agent reinforcement learning

Jianzhun Shao, Yun Qu, Chen Chen, Hongchang Zhang, and Xiangyang Ji. Counterfactual conservative q learning for offline multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36:77290–77312, 2023

work page 2023
[55]

Learning to coordinate from offline datasets with uncoordinated behavior policies

Jinming Ma and Feng Wu. Learning to coordinate from offline datasets with uncoordinated behavior policies. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 1258–1266, 2023

work page 2023
[56]

Understanding individual agent importance in multi-agent system via counterfactual reasoning

Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, Jun Hu, Qing Wang, and Fanjiang Xu. Understanding individual agent importance in multi-agent system via counterfactual reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15785–15794, 2025

work page 2025
[57]

Statemask: Explaining deep reinforcement learning through state mask

Zelei Cheng, Xian Wu, Jiahao Yu, Wenhai Sun, Wenbo Guo, and Xinyu Xing. Statemask: Explaining deep reinforcement learning through state mask. Advances in Neural Information Processing Systems, 36:62457–62487, 2023

work page 2023
[58]

Learning nearly de- composable value functions via communication minimization

Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly de- composable value functions via communication minimization. In International Conference on Learning Representations, 2020

work page 2020
[59]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

work page 2018
[60]

Rode: Learning roles to decompose multi-agent tasks

Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, and Chongjie Zhang. Rode: Learning roles to decompose multi-agent tasks. In International Conference on Learning Representations, 2021

work page 2021
[61]

The StarCraft Multi-Agent Challenge,

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nan- tas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019

work page arXiv 1902
[62]

Google research football: A novel reinforcement learning environment

Karol Kurach, Anton Raichuk, Piotr Sta´nczyk, Michał Zaj ˛ ac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4501–4510, 2020

work page 2020
[63]

Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning

Jian Hu, Siyang Jiang, Seth Austin Harding, Haibin Wu, and Shih wei Liao. Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning. 2021. 13 A Broader Impact This work advances cooperative multi-agent systems by introducing a framework that fosters coordi- nated behavior through influence-based int...

work page 2021
[64]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction (Section 1) clearly and accurately reflect the paper’s main contributions and scope. Guidelines: • The answer NA means that the abstract and introduction do not include...

work page
[65]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 6 highlights that FIM assumes static influence targets and does not account for temporal adaptability. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limita...

work page
[66]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 26 Answer: [NA] Justification: The paper does not include theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and p...

work page
[67]

Justification: The paper ensures reproducibility by providing the full code along with detailed hyperparameter settings in Table 4

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] . Justification: The paper ensures reprod...

work page
[68]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code 27 Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is openly accessible with clear instructions for reproducing the main experimental results. Gui...

work page
[69]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper provides all necessary training and testing details, including hyper- parameter settings, ablation ...

work page
[70]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: All performance plots report the mean across 5 random seeds with shaded areas denoting standard deviation, as stated in Section 5....

work page
[71]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Section D specifies the GPU/CPU setup and Section E discusses the time of execution. Guidelines: • The ...

work page
[72]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The research uses standard simulation benchmarks without involving human subjects or sensitive data and adheres to NeurIPS ethical guidelines throughout. Guide...

work page
[73]

Guidelines: 29 • The answer NA means that there is no societal impact of the work performed

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Societal impacts are discussed in Appendix A and no negative impacts are foreseen. Guidelines: 29 • The answer NA means that there is no societal impact of the work performed. • If the ...

work page
[74]

Guidelines: • The answer NA means that the paper poses no such risks

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no significant risk of misuse, as it does not release models or datasets with d...

work page
[75]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: All used environments and baselines are properly cited in Section 5. Guidelines: • The answer NA...

work page
[76]

Guidelines: • The answer NA means that the paper does not release new assets

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The released code are documented and provided with usage instructions. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the details of...

work page
[77]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The research does not involve any human subjects or c...

work page
[78]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page
[79]

Answer: [NA] Justification: LLMs were not used in the development of the core methodology of this work

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025

[1] [1]

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Multi-agent reinforcement learning for redundant robot control in task-space

Adolfo Perrusquía, Wen Yu, and Xiaoou Li. Multi-agent reinforcement learning for redundant robot control in task-space. International Journal of Machine Learning and Cybernetics, 12: 231–241, 2021

work page 2021

[3] [3]

Grandmaster level in starcraft ii using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. nature, 575(7782):350–354, 2019

work page 2019

[4] [4]

A concise introduction to decentralized POMDPs, volume 1

Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016

work page 2016

[5] [5]

Optimal and approximate q-value functions for decentralized pomdps

Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289–353, 2008

work page 2008

[6] [6]

The surprising effectiveness of ppo in cooperative multi-agent games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and YI WU. The surprising effectiveness of ppo in cooperative multi-agent games. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24611–24624. Curran Associates, Inc., 2022

work page 2022

[7] [7]

Value- decomposition networks for cooperative multi-agent learning based on team reward

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning based on team reward. In Pro- ceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems,...

work page 2085

[8] [8]

Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pages 4295–4304. PMLR, 2018

work page 2018

[9] [9]

Qplex: Duplex dueling multi-agent q-learning

Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. In International Conference on Learning Representations

work page

[10] [10]

Social influence as intrinsic motivation for multi-agent deep reinforcement learning

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, pages 3040–3049. PMLR, 2019

work page 2019

[11] [11]

Influence-based multi-agent exploration

Tonghan Wang, Jianhao Wang, Yi Wu, and Chongjie Zhang. Influence-based multi-agent exploration. In International Conference on Learning Representations, 2020

work page 2020

[12] [12]

Cooperative exploration for multi-agent deep reinforcement learning

Iou-Jen Liu, Unnat Jain, Raymond A Yeh, and Alexander Schwing. Cooperative exploration for multi-agent deep reinforcement learning. In International conference on machine learning, pages 6826–6836. PMLR, 2021

work page 2021

[13] [13]

Episodic multi-agent reinforcement learning with curiosity- driven exploration

Lulu Zheng, Jiarui Chen, Jianhao Wang, Jiamin He, Yujing Hu, Yingfeng Chen, Changjie Fan, Yang Gao, and Chongjie Zhang. Episodic multi-agent reinforcement learning with curiosity- driven exploration. Advances in Neural Information Processing Systems, 34:3757–3769, 2021

work page 2021

[14] [14]

Celebrating diversity in shared multi-agent reinforcement learning

Chenghao Li, Tonghan Wang, Chengjie Wu, Qianchuan Zhao, Jun Yang, and Chongjie Zhang. Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Infor- mation Processing Systems, 34:3991–4002, 2021

work page 2021

[15] [15]

Clausewitz’s center of gravity: It’s not what we thought

Antulio J Echevarria. Clausewitz’s center of gravity: It’s not what we thought. Naval War College Review, 56(1):108–123, 2003

work page 2003

[16] [16]

Coordinated exploration via intrinsic rewards for multi-agent rein- forcement learning

Shariq Iqbal and Fei Sha. Coordinated exploration via intrinsic rewards for multi-agent rein- forcement learning. arXiv preprint arXiv:1905.12127, 2019. 10

work page arXiv 1905

[17] [17]

Two heads are better than one: A simple exploration framework for efficient multi-agent reinforcement learning

Jiahui Li, Kun Kuang, Baoxiang Wang, Xingchen Li, Fei Wu, Jun Xiao, and Long Chen. Two heads are better than one: A simple exploration framework for efficient multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36:20038–20053, 2023

work page 2023

[18] [18]

Self-motivated multi- agent exploration

Shaowei Zhang, Jiahan Cao, Lei Yuan, Yang Yu, and De-Chuan Zhan. Self-motivated multi- agent exploration. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 476–484, 2023

work page 2023

[19] [19]

Cmbe: Curiosity-driven model-based exploration for multi-agent reinforcement learning in sparse reward settings

Kai Yang, Zhirui Fang, Xiu Li, and Jian Tao. Cmbe: Curiosity-driven model-based exploration for multi-agent reinforcement learning in sparse reward settings. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2024

work page 2024

[20] [20]

Population-based diverse exploration for sparse-reward multi-agent tasks

Pei Xu, Junge Zhang, and Kaiqi Huang. Population-based diverse exploration for sparse-reward multi-agent tasks. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 283–291, 2024

work page 2024

[21] [21]

Expode: Exploiting policy discrepancy for efficient exploration in multi-agent reinforcement learning

Yucong Zhang and Chao Yu. Expode: Exploiting policy discrepancy for efficient exploration in multi-agent reinforcement learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 58–66, 2023

work page 2023

[22] [22]

Toward efficient multi-agent exploration with trajectory entropy maximization

Tianxu Li and Kun Zhu. Toward efficient multi-agent exploration with trajectory entropy maximization. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[23] [23]

Learning joint behaviors with large variations

Tianxu Li and Kun Zhu. Learning joint behaviors with large variations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23249–23257, 2025

work page 2025

[24] [24]

Maven: Multi-agent variational exploration

Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent variational exploration. Advances in neural information processing systems, 32, 2019

work page 2019

[25] [25]

Fox: Formation-aware exploration in multi-agent reinforcement learning

Yonghyeon Jo, Sunwoo Lee, Junghyuk Yeom, and Seungyul Han. Fox: Formation-aware exploration in multi-agent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12985–12994, 2024

work page 2024

[26] [26]

Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction

Hongyao Tang, Jianye Hao, Tangjie Lv, Yingfeng Chen, Zongzhang Zhang, Hangtian Jia, Chunxu Ren, Yan Zheng, Zhaopeng Meng, Changjie Fan, et al. Hierarchical deep multiagent reinforcement learning with temporal abstraction. arXiv preprint arXiv:1809.09332, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Maser: Multi-agent rein- forcement learning with subgoals generated from experience replay buffer

Jeewon Jeon, Woojun Kim, Whiyoung Jung, and Youngchul Sung. Maser: Multi-agent rein- forcement learning with subgoals generated from experience replay buffer. In International conference on machine learning, pages 10041–10052. PMLR, 2022

work page 2022

[28] [28]

Subspace-aware exploration for sparse-reward multi-agent tasks

Pei Xu, Junge Zhang, Qiyue Yin, Chao Yu, Yaodong Yang, and Kaiqi Huang. Subspace-aware exploration for sparse-reward multi-agent tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11717–11725, 2023

work page 2023

[29] [29]

Saeir: sequentially accumulated entropy intrinsic reward for cooperative multi-agent reinforcement learning with sparse reward

Xin He, Hongwei Ge, Yaqing Hou, and Jincheng Yu. Saeir: sequentially accumulated entropy intrinsic reward for cooperative multi-agent reinforcement learning with sparse reward. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 4107–4115, 2024

work page 2024

[30] [30]

Elign: Expectation alignment as a multi-agent intrinsic reward.Advances in Neural Information Processing Systems, 35:8304–8317, 2022

Zixian Ma, Rose Wang, Fei-Fei Li, Michael Bernstein, and Ranjay Krishna. Elign: Expectation alignment as a multi-agent intrinsic reward.Advances in Neural Information Processing Systems, 35:8304–8317, 2022

work page 2022

[31] [31]

Pmic: Improving multi-agent reinforcement learning with progressive mutual information collaboration

Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E Taylor, Wenyuan Tao, and Zhen Wang. Pmic: Improving multi-agent reinforcement learning with progressive mutual information collaboration. In International Conference on Machine Learning, pages 12979–12997. PMLR, 2022

work page 2022

[32] [32]

Cooperative multiagent learning and exploration with min–max intrinsic motivation

Yaqing Hou, Jie Kang, Haiyin Piao, Yifeng Zeng, Yew-Soon Ong, Yaochu Jin, and Qiang Zhang. Cooperative multiagent learning and exploration with min–max intrinsic motivation. IEEE Transactions on Cybernetics, 2025. 11

work page 2025

[33] [33]

Learning individually inferred communication for multi-agent cooperation

Ziluo Ding, Tiejun Huang, and Zongqing Lu. Learning individually inferred communication for multi-agent cooperation. Advances in neural information processing systems, 33:22069–22079, 2020

work page 2020

[34] [34]

Learning with opponent-learning awareness

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages 122–130, 2018

work page 2018

[35] [35]

Stable opponent shaping in differentiable games

Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations, 2019

work page 2019

[36] [36]

Learning latent representations to influence multi-agent interaction

Annie Xie, Dylan Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. In Conference on robot learning, pages 575–588. PMLR, 2021

work page 2021

[37] [37]

Influencing long-term behavior in multiagent reinforcement learning

Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob Foerster, Michael Everett, Chuangchuang Sun, Gerald Tesauro, and Jonathan P How. Influencing long-term behavior in multiagent reinforcement learning. Advances in Neural Information Processing Systems, 35:18808–18821, 2022

work page 2022

[38] [38]

Imagine, initialize, and explore: An effective exploration method in multi-agent reinforcement learning

Zeyang Liu, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, and Xuguang Lan. Imagine, initialize, and explore: An effective exploration method in multi-agent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17487–17495, 2024

work page 2024

[39] [39]

Settling decentralized multi-agent coordinated exploration by novelty sharing

Haobin Jiang, Ziluo Ding, and Zongqing Lu. Settling decentralized multi-agent coordinated exploration by novelty sharing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17444–17452, 2024

work page 2024

[40] [40]

Learning to incentivize other learning agents

Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, and Hongyuan Zha. Learning to incentivize other learning agents. Advances in Neural Information Processing Systems, 33:15208–15219, 2020

work page 2020

[41] [41]

Learning to penalize other learning agents

Kyrill Schmid, Lenz Belzner, and Claudia Linnhoff-Popien. Learning to penalize other learning agents. In Artificial Life Conference Proceedings 33, volume 2021, page 59. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2021

work page 2021

[42] [42]

Reciprocal reward influence encourages cooperation from self-interested agents

John L Zhou, Weizhe Hong, and Jonathan Kao. Reciprocal reward influence encourages cooperation from self-interested agents. Advances in Neural Information Processing Systems, 37:59491–59512, 2024

work page 2024

[43] [43]

Lazy agents: A new perspective on solving sparse reward problem in multi-agent reinforcement learning

Boyin Liu, Zhiqiang Pu, Yi Pan, Jianqiang Yi, Yanyan Liang, and Du Zhang. Lazy agents: A new perspective on solving sparse reward problem in multi-agent reinforcement learning. In International Conference on Machine Learning, pages 21937–21950. PMLR, 2023

work page 2023

[44] [44]

Individual contributions as intrinsic explo- ration scaffolds for multi-agent reinforcement learning

Xinran Li, Zifan Liu, Shibo Chen, and Jun Zhang. Individual contributions as intrinsic explo- ration scaffolds for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 28387–28402. PMLR, 2024

work page 2024

[45] [45]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[46] [46]

On the use and misuse of absorbing states in multi-agent reinforcement learning

Andrew Cohen, Ervin Teng, Vincent-Pierre Berges, Ruo-Ping Dong, Hunter Henry, Marwan Mattar, Alexander Zook, and Sujoy Ganguly. On the use and misuse of absorbing states in multi-agent reinforcement learning. arXiv preprint arXiv:2111.05992, 2021

work page arXiv 2021

[47] [47]

Towards un- derstanding cooperative multi-agent q-learning with value factorization

Jianhao Wang, Zhizhou Ren, Beining Han, Jianing Ye, and Chongjie Zhang. Towards un- derstanding cooperative multi-agent q-learning with value factorization. Advances in Neural Information Processing Systems, 34:29142–29155, 2021. 12

work page 2021

[48] [48]

Global rewards in multi-agent deep reinforcement learning for autonomous mobility on demand systems

Heiko Hoppe, Tobias Enders, Quentin Cappart, and Maximilian Schiffer. Global rewards in multi-agent deep reinforcement learning for autonomous mobility on demand systems. In 6th Annual Learning for Dynamics & Control Conference, pages 260–272. PMLR, 2024

work page 2024

[49] [49]

Pac: Assisted value factorization with coun- terfactual predictions in multi-agent reinforcement learning

Hanhan Zhou, Tian Lan, and Vaneet Aggarwal. Pac: Assisted value factorization with coun- terfactual predictions in multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 35:15757–15769, 2022

work page 2022

[50] [50]

Aligning credit for multi-agent co- operation via model-based counterfactual imagination

Jiajun Chai, Yuqian Fu, Dongbin Zhao, and Yuanheng Zhu. Aligning credit for multi-agent co- operation via model-based counterfactual imagination. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pages 281–289, 2024

work page 2024

[51] [51]

Shapley q-value: A local reward approach to solve global reward games

Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: A local reward approach to solve global reward games. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7285–7292, 2020

work page 2020

[52] [52]

Shapley counterfactual credits for multi-agent reinforcement learning

Jiahui Li, Kun Kuang, Baoxiang Wang, Furui Liu, Long Chen, Fei Wu, and Jun Xiao. Shapley counterfactual credits for multi-agent reinforcement learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 934–942, 2021

work page 2021

[53] [53]

Shaq: Incorporating shapley value theory into multi-agent q-learning

Jianhong Wang, Yuan Zhang, Yunjie Gu, and Tae-Kyun Kim. Shaq: Incorporating shapley value theory into multi-agent q-learning. Advances in Neural Information Processing Systems, 35:5941–5954, 2022

work page 2022

[54] [54]

Counterfactual conservative q learning for offline multi-agent reinforcement learning

Jianzhun Shao, Yun Qu, Chen Chen, Hongchang Zhang, and Xiangyang Ji. Counterfactual conservative q learning for offline multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36:77290–77312, 2023

work page 2023

[55] [55]

Learning to coordinate from offline datasets with uncoordinated behavior policies

Jinming Ma and Feng Wu. Learning to coordinate from offline datasets with uncoordinated behavior policies. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 1258–1266, 2023

work page 2023

[56] [56]

Understanding individual agent importance in multi-agent system via counterfactual reasoning

Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, Jun Hu, Qing Wang, and Fanjiang Xu. Understanding individual agent importance in multi-agent system via counterfactual reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15785–15794, 2025

work page 2025

[57] [57]

Statemask: Explaining deep reinforcement learning through state mask

Zelei Cheng, Xian Wu, Jiahao Yu, Wenhai Sun, Wenbo Guo, and Xinyu Xing. Statemask: Explaining deep reinforcement learning through state mask. Advances in Neural Information Processing Systems, 36:62457–62487, 2023

work page 2023

[58] [58]

Learning nearly de- composable value functions via communication minimization

Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly de- composable value functions via communication minimization. In International Conference on Learning Representations, 2020

work page 2020

[59] [59]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

work page 2018

[60] [60]

Rode: Learning roles to decompose multi-agent tasks

Tonghan Wang, Tarun Gupta, Anuj Mahajan, Bei Peng, Shimon Whiteson, and Chongjie Zhang. Rode: Learning roles to decompose multi-agent tasks. In International Conference on Learning Representations, 2021

work page 2021

[61] [61]

The StarCraft Multi-Agent Challenge,

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nan- tas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019

work page arXiv 1902

[62] [62]

Google research football: A novel reinforcement learning environment

Karol Kurach, Anton Raichuk, Piotr Sta´nczyk, Michał Zaj ˛ ac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4501–4510, 2020

work page 2020

[63] [63]

Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning

Jian Hu, Siyang Jiang, Seth Austin Harding, Haibin Wu, and Shih wei Liao. Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning. 2021. 13 A Broader Impact This work advances cooperative multi-agent systems by introducing a framework that fosters coordi- nated behavior through influence-based int...

work page 2021

[64] [64]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and introduction (Section 1) clearly and accurately reflect the paper’s main contributions and scope. Guidelines: • The answer NA means that the abstract and introduction do not include...

work page

[65] [65]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 6 highlights that FIM assumes static influence targets and does not account for temporal adaptability. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limita...

work page

[66] [66]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 26 Answer: [NA] Justification: The paper does not include theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and p...

work page

[67] [67]

Justification: The paper ensures reproducibility by providing the full code along with detailed hyperparameter settings in Table 4

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] . Justification: The paper ensures reprod...

work page

[68] [68]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code 27 Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is openly accessible with clear instructions for reproducing the main experimental results. Gui...

work page

[69] [69]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper provides all necessary training and testing details, including hyper- parameter settings, ablation ...

work page

[70] [70]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: All performance plots report the mean across 5 random seeds with shaded areas denoting standard deviation, as stated in Section 5....

work page

[71] [71]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Section D specifies the GPU/CPU setup and Section E discusses the time of execution. Guidelines: • The ...

work page

[72] [72]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: The research uses standard simulation benchmarks without involving human subjects or sensitive data and adheres to NeurIPS ethical guidelines throughout. Guide...

work page

[73] [73]

Guidelines: 29 • The answer NA means that there is no societal impact of the work performed

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Societal impacts are discussed in Appendix A and no negative impacts are foreseen. Guidelines: 29 • The answer NA means that there is no societal impact of the work performed. • If the ...

work page

[74] [74]

Guidelines: • The answer NA means that the paper poses no such risks

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no significant risk of misuse, as it does not release models or datasets with d...

work page

[75] [75]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: All used environments and baselines are properly cited in Section 5. Guidelines: • The answer NA...

work page

[76] [76]

Guidelines: • The answer NA means that the paper does not release new assets

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The released code are documented and provided with usage instructions. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the details of...

work page

[77] [77]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The research does not involve any human subjects or c...

work page

[78] [78]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[79] [79]

Answer: [NA] Justification: LLMs were not used in the development of the core methodology of this work

Declaration of LLM usage Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, decla...

work page 2025