Learning Strategic Value and Cooperation in Multi-Player Stochastic Games through Side Payments

Alan Kuhnle; Darleen Perez-Lavin; Jeffrey Richley; Jessica Singh Syal; Solmaz Kia; Yixin Chen

arxiv: 2303.05307 · v2 · submitted 2023-03-09 · 💻 cs.GT · cs.AI

Learning Strategic Value and Cooperation in Multi-Player Stochastic Games through Side Payments

Yixin Chen , Jeffrey Richley , Darleen Perez-Lavin , Jessica Singh Syal , Solmaz Kia , Alan Kuhnle This is my paper

Pith reviewed 2026-05-24 09:28 UTC · model grok-4.3

classification 💻 cs.GT cs.AI

keywords stochastic gamesHarsanyi-Shapley valueside paymentsmulti-player cooperationtransferable utilityvalue conceptsBellman operatoraxiomatic characterization

0 comments

The pith

In multi-player stochastic games, HS-S is the unique mapping satisfying the extended Harsanyi-Shapley axioms, yet it diverges from Coco-S when more than two players participate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the Harsanyi-Shapley value from normal-form games to stochastic games with transferable utility by introducing HS-S, which aggregates dynamic coalition-versus-complement threat powers, and Coco-S, defined as fixed points of a statewise HS Bellman operator. It proves that HS-S is the only mapping satisfying the stochastic version of the HS axioms. The two notions agree in every two-player stochastic game but produce different values in some three-player games, as shown by an explicit counterexample. The work further establishes existence and uniqueness of Coco-S fixed points for all two-player games and three-player two-state games, introduces a Markov Consistency axiom that characterizes Coco-S, and supplies sampling estimators with finite-sample guarantees for computing the values and induced side payments.

Core claim

HS-S is the unique mapping satisfying the extended HS-style axioms in the stochastic setting. HS-S and Coco-S coincide in all two-player stochastic games but disagree for n>2 via an explicit three-player counterexample. Existence and uniqueness of Coco-S fixed points hold for all two-player games and three-player two-state games, and Coco-S admits an axiomatic characterization via a new Markov Consistency axiom that distinguishes it from HS-S.

What carries the argument

The statewise HS Bellman operator whose fixed points define Coco-S, together with the aggregation of dynamic coalition-versus-complement threat powers that defines HS-S.

If this is right

HS-S and Coco-S produce the same values, policies, and side payments in every two-player stochastic game.
There exist three-player stochastic games in which HS-S and Coco-S assign different values to the players.
Coco-S fixed points exist and are unique for every two-player stochastic game and every three-player game with exactly two states.
Sampling-based estimators approximate both values with explicit finite-sample error bounds.
The Markov Consistency axiom supplies an alternative axiomatic foundation for Coco-S that does not hold for HS-S.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For games with more than two players the choice between HS-S and Coco-S can produce different recommended side-payment schedules that sustain cooperation.
The topological-degree argument used to prove uniqueness of fixed points may extend to three-player games with more than two states under additional continuity or contraction assumptions.
These value notions could be used to shape reward shaping or transfer mechanisms in multi-agent reinforcement learning environments that allow utility transfers across time.
The distinction between the two concepts highlights a potential trade-off between axiomatic uniqueness and recursive consistency when designing solution concepts for dynamic games.

Load-bearing premise

The Harsanyi-Shapley axioms from normal-form games extend to stochastic games while preserving uniqueness, and the statewise HS Bellman operator admits fixed points whose properties distinguish Coco-S from HS-S.

What would settle it

Explicit computation of HS-S and Coco-S values on the three-player counterexample game to check whether the two differ as claimed, or verification that only HS-S satisfies every listed extended axiom.

Figures

Figures reproduced from arXiv: 2303.05307 by Alan Kuhnle, Darleen Perez-Lavin, Jeffrey Richley, Jessica Singh Syal, Solmaz Kia, Yixin Chen.

**Figure 1.** Figure 1: The banana game. Each entry of the matrix is an ordered pair of the form (u1, u2), which indicates that the row player gets utility u1 and the column player gets utility u2 for the corresponding action selection. In this work, the problem of how to compute the strategic value of a player in transferable-utility (TU), general-sum stochastic games is considered, and how to use the strategic strength of a pl… view at source ↗

**Figure 2.** Figure 2: (Right) Illustration of the decomposition of an n-player game G into 2 n − 1 coalitional games vs. (Left) the Coco decomposition of Kalai and Kalai (2013) for 2-player games. where N is an operator that determines how the agents will pick their joint action a. The Minimax-Q, Nash-Q, CorrelatedN Q, and Coco-Q algorithms referenced above set = maxmin, NASH, CE, and HS, respectively. Of these, only Minimax-… view at source ↗

**Figure 3.** Figure 3: The strategic value for each player in each game, as computed by the corresponding algorithm. environment is known. Source code to reproduce the results is available in the supplementary material. Summary of results. We find that both HS and HS∗ learn sensible assessments of the strategic strength of each player and enable maximizing the overall combined score while preserving the competitive nature of the… view at source ↗

**Figure 4.** Figure 4: ), Players A, B, and C each have their own goals they need to reach without colliding by coordinating their moves across the grid. Notice that Players A and C are symmetric, but player B is closer to her goal than the other players. Therefore, one would expect that the strategic value State HS HS* 1 (-0.1, -0.2, 0.3) (-7.6, 15.2, -7.6) 2 (0.2, -0.3, 0.1) (-5.0, -7.9, 12.9) 3 (-16.3, 32.1, -15.8) (-30.4, 52… view at source ↗

**Figure 5.** Figure 5: , has a weak player (Player D) who starts two steps away from a shared goal with reward 100. The other players start one step from the shared goal, but have their own goals worth 1000 three steps away. If the other players try to move to their high-value goals, the weaker player can act as a spoiler (which is also in her self interest) by moving to the shared goal and ending the game. Both HS and HS∗ deter… view at source ↗

**Figure 6.** Figure 6: Prisoners. Top: A trajectory of HS, HS∗ (Left) and Correlated-Q (Right). Bottom: Side payments for the trajectory in top left [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Multiple trajectories generated by the HS and HS∗ policy in the Turkey Game. The first (Top) shows Player A attempting the semi-wall, the second (Bottom-Left) shows Player C moving through the semi-wall. The final trajectory (Bottom-Right), shows Player B moving out of the way of Player A for a guaranteed movement [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

We study general-sum, multi-player stochastic games with transferable utility, motivated by settings where agents can use side payments to make cooperation individually rational. Building on the Harsanyi--Shapley (HS) value for normal-form games, we introduce two HS-based value notions for stochastic games: HS-S, defined by aggregating dynamic coalition-versus-complement threat powers, and Coco-S, defined as fixed points of a statewise HS Bellman operator. We extend HS-style axioms to the stochastic setting and show that HS-S is the unique mapping satisfying them. We prove that HS-S and Coco-S coincide in all two-player stochastic games, but can disagree when $n>2$, via an explicit three-player counterexample. We prove existence and uniqueness of Coco-S fixed points for all two-player games and for three-player two-state games via topological degree theory, and provide an axiomatic characterization of Coco-S through a new \emph{Markov Consistency} axiom that distinguishes it from HS-S. Finally, we give sampling-based estimators with finite-sample guarantees and empirically compare the induced values, policies, and side payments on multi-player grid-game benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HS-S and Coco-S coincide for two players but the three-player disagreement claim may collapse if the counterexample has more than two states.

read the letter

HS-S and Coco-S are the two new value notions here, and they match in two-player stochastic games but the paper claims they diverge for three players via an explicit counterexample. The work extends the Harsanyi-Shapley axioms to stochastic games and shows uniqueness for HS-S. It defines Coco-S through fixed points of the statewise HS Bellman operator and introduces Markov Consistency to distinguish it. Coincidence holds for all two-player cases. Existence and uniqueness of the fixed points are proven for all two-player games and three-player two-state games using topological degree theory. The sampling estimators come with finite-sample guarantees, and the grid-game experiments compare the values, policies, and payments in practice. The potential issue is with the counterexample. Since Coco-S existence is only shown for three-player games with two states, the disagreement result depends on the example having at most two states. The abstract leaves the number of states unspecified, so if it has more the fixed point may not exist and the claim falls apart. This is a concrete soft spot that should be checked against the full text. The rest of the development looks non-circular and builds on standard tools. Readers in cooperative game theory or multi-agent systems with transferable utility will get the most from this. It is worth a serious referee because it brings new concepts, formal characterizations, and some empirical checks to the table, even with the existence caveat. I recommend sending it for peer review after confirming the counterexample details.

Referee Report

1 major / 2 minor

Summary. The paper introduces HS-S (aggregating dynamic coalition threat powers) and Coco-S (fixed points of a statewise HS Bellman operator) as value notions for general-sum multi-player stochastic games with transferable utility. It extends Harsanyi-Shapley axioms to prove HS-S is the unique mapping satisfying them in the stochastic setting, shows HS-S and Coco-S coincide in all two-player stochastic games, provides an explicit three-player counterexample where they disagree for n>2, proves existence and uniqueness of Coco-S fixed points for all two-player games and three-player two-state games via topological degree theory, gives an axiomatic characterization of Coco-S via a new Markov Consistency axiom, and supplies sampling-based estimators with finite-sample guarantees, with empirical comparisons on multi-player grid games.

Significance. If the central claims hold, the work meaningfully extends cooperative solution concepts to stochastic games, supplying both axiomatic foundations that distinguish the two notions and practical estimators. The explicit counterexample, topological-degree existence results, and finite-sample guarantees are concrete strengths that could support further theoretical and algorithmic work on side-payment-enabled cooperation in multi-agent systems.

major comments (1)

[Section 4 (counterexample) and Section 3 (existence results)] The disagreement claim between HS-S and Coco-S for n>2 rests on the explicit three-player counterexample. However, existence and uniqueness of Coco-S fixed points (via the statewise HS Bellman operator and topological degree arguments) is established only for all two-player games and three-player two-state games. The manuscript must specify the number of states in this counterexample; if it exceeds two states, Coco-S may fail to be well-defined there, so the comparison cannot be asserted.

minor comments (2)

[Section 3] Notation for the statewise HS Bellman operator and the Markov Consistency axiom could be introduced with an explicit equation reference on first use to improve readability.
[Section 6] The finite-sample guarantees for the estimators are stated but the dependence on the number of players and states should be made fully explicit in the theorem statement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive comment on the relationship between the counterexample and the existence results. We respond to the major comment below.

read point-by-point responses

Referee: [Section 4 (counterexample) and Section 3 (existence results)] The disagreement claim between HS-S and Coco-S for n>2 rests on the explicit three-player counterexample. However, existence and uniqueness of Coco-S fixed points (via the statewise HS Bellman operator and topological degree arguments) is established only for all two-player games and three-player two-state games. The manuscript must specify the number of states in this counterexample; if it exceeds two states, Coco-S may fail to be well-defined there, so the comparison cannot be asserted.

Authors: We appreciate the referee highlighting this important clarification. The explicit three-player counterexample presented in Section 4 is constructed on a game with exactly two states. This places it squarely within the class of three-player two-state games for which existence and uniqueness of the Coco-S fixed point is established via topological degree theory in Section 3. We will revise the manuscript to explicitly state the number of states in the counterexample (both in the main text and in the caption of the relevant figure/table) to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; axiomatic and topological arguments are independent of inputs.

full rationale

The paper defines HS-S by aggregating dynamic coalition threat powers and Coco-S as fixed points of the statewise HS Bellman operator. It extends HS axioms to prove uniqueness of HS-S and introduces a new Markov Consistency axiom to characterize Coco-S. Existence/uniqueness of Coco-S fixed points is established via topological degree theory for the stated game classes, with an explicit counterexample for disagreement when n>2. No derivation step reduces by construction to a fitted parameter, self-citation chain, or self-definitional equivalence. The central claims rest on the stated axioms, operator definitions, and external mathematical results rather than circular reductions. The potential gap noted in the skeptic headline concerns proof coverage for the counterexample but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on extending HS axioms to stochastic games and introducing a new Markov Consistency axiom; no free parameters or invented physical entities are described.

axioms (2)

domain assumption HS-style axioms can be extended to stochastic games while preserving uniqueness of the value mapping
Invoked when claiming HS-S is the unique mapping satisfying the extended axioms
domain assumption The statewise HS Bellman operator admits fixed points that satisfy Markov Consistency
Used to define Coco-S and distinguish it from HS-S

invented entities (2)

HS-S value no independent evidence
purpose: Aggregating dynamic coalition-versus-complement threat powers across states
New definition introduced for stochastic games
Coco-S value no independent evidence
purpose: Fixed point of the statewise HS Bellman operator
New definition introduced for stochastic games

pith-pipeline@v0.9.0 · 5747 in / 1638 out tokens · 28036 ms · 2026-05-24T09:28:42.325423+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HS-S defined by aggregating dynamic coalition-versus-complement threat powers; Coco-S as fixed points of statewise HS Bellman operator; Markov Consistency axiom
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Existence/uniqueness via topological degree for two-player and three-player two-state games

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Csaba Szepesv \'a ri and Michael L. Littman. Generalized markov decision processes: Dynamic-programming and reinforcement-learning algorithms. Technical Report, 0 (November): 0 1--54, 1996

work page 1996
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[3]

Deep Q-Learning for Nash Equilibria : Nash-DQN

Philippe Casgrain, Brian Ning, and Sebastian Jaimungal. Deep Q-Learning for Nash Equilibria : Nash-DQN . Applied Mathematical Finance, 29 0 (1): 0 62--78, January 2022

work page 2022
[4]

Goldberg, and Christos H

Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The Complexity of Computing a Nash Equilibrium . SIAM Journal on Computing, 39 0 (1): 0 195--259, January 2009

work page 2009
[5]

On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games , September 2021

Xiaotie Deng, Yuhao Li, David Henry Mguni, Jun Wang, and Yaodong Yang. On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games , September 2021

work page 2021
[6]

Correlated Q-learning

Amy Greenwald, Keith Hall, and Roberto Serrano. Correlated Q-learning . In ICML , volume 3, pages 242--249, 2003

work page 2003
[7]

Multi-agent deep reinforcement learning: A survey

Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: A survey. Artificial Intelligence Review, 55 0 (2): 0 895--943, 2022

work page 2022
[8]

Harsanyi

John C. Harsanyi. A Simplified Bargaining Model for the n- Person Cooperative Game . International Economic Review, 4 0 (2): 0 194--220, 1963

work page 1963
[9]

Pablo Hernandez-Leal , Bilal Kartal, and Matthew E. Taylor. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 33 0 (6): 0 750--797, November 2019

work page 2019
[10]

Junling Hu and Michael P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of machine learning research, 4 0 (Nov): 0 1039--1069, 2003

work page 2003
[11]

Anthony, Tom Eccles, Joel Z

Edward Hughes, Thomas W. Anthony, Tom Eccles, Joel Z. Leibo, David Balduzzi, and Yoram Bachrach. Learning to Resolve Alliance Dilemmas in Many-Player Zero-Sum Games , February 2020

work page 2020
[12]

Cooperation in Strategic Games Revisited *

Adam Kalai and Ehud Kalai. Cooperation in Strategic Games Revisited *. The Quarterly Journal of Economics, 128 0 (2): 0 917--966, May 2013

work page 2013
[13]

Kalai and R

E. Kalai and R. W. Rosenthal. Arbitration of two-party disputes under ignorance. International Journal of Game Theory, 7 0 (2): 0 65--72, June 1978

work page 1978
[14]

Cooperative strategic games

Elon Kohlberg and Abraham Neyman. Cooperative strategic games. Theoretical Economics, 16 0 (3): 0 825--851, 2021

work page 2021
[15]

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In William W. Cohen and Haym Hirsh, editors, Machine Learning Proceedings 1994 , pages 157--163. Morgan Kaufmann , San Francisco (CA) , January 1994

work page 1994
[16]

Michael L. Littman. Friend-or-foe Q-learning in general-sum games. In ICML , volume 1, pages 322--328, 2001

work page 2001
[17]

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

Weichao Mao and Tamer Ba s ar. Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games . Dynamic Games and Applications, January 2022

work page 2022
[18]

On improving model-free algorithms for decentralized multi-agent reinforcement learning

Weichao Mao, Lin Yang, Kaiqing Zhang, and Tamer Basar. On improving model-free algorithms for decentralized multi-agent reinforcement learning. In International Conference on Machine Learning , pages 15007--15049. PMLR , 2022

work page 2022
[19]

Two- Person Cooperative Games

John Nash. Two- Person Cooperative Games . Econometrica, 21 0 (1): 0 128--140, 1953

work page 1953
[20]

If multi-agent learning is the answer, what is the question? Artificial intelligence, 171 0 (7): 0 365--377, 2007

Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, what is the question? Artificial intelligence, 171 0 (7): 0 365--377, 2007

work page 2007
[21]

Coco- Q : Learning in Stochastic Games with Side Payments

Eric Sodomka, Elizabeth Hilliard, Michael Littman, and Amy Greenwald. Coco- Q : Learning in Stochastic Games with Side Payments . In Proceedings of the 30th International Conference on Machine Learning , pages 1471--1479. PMLR , May 2013

work page 2013
[22]

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently ? arXiv preprint arXiv:2110.04184, 2021

Ziang Song, Song Mei, and Yu Bai. When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently ? arXiv preprint arXiv:2110.04184, 2021

work page arXiv 2021
[23]

A Friend-or-Foe framework for multi-agent reinforcement learning policy generation in mixing cooperative competitive scenarios

Yu Sun, Jun Lai, Lei Cao, Xiliang Chen, Zhixiong Xu, Zhen Lian, and Huijin Fan. A Friend-or-Foe framework for multi-agent reinforcement learning policy generation in mixing cooperative competitive scenarios. Transactions of the Institute of Measurement and Control, page 01423312221077755, 2022

work page 2022
[24]

Online learning in unknown markov games

Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Online learning in unknown markov games. In International Conference on Machine Learning, pages 10279--10288. PMLR , 2021

work page 2021
[25]

Von Neumann and O

J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Theory of Games and Economic Behavior. Princeton University Press , Princeton, NJ, US , 1944

work page 1944
[26]

Wang, Edward Hughes, Chrisantha Fernando, Wojciech M

Jane X. Wang, Edward Hughes, Chrisantha Fernando, Wojciech M. Czarnecki, Edgar A. Duenez-Guzman , and Joel Z. Leibo. Evolving intrinsic motivations for altruistic behavior, March 2019

work page 2019
[27]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8 0 (3): 0 279--292, May 1992

work page 1992
[28]

Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium

Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on Learning Theory, pages 3674--3682. PMLR , 2020

work page 2020
[29]

Learning to Incentivize Other Learning Agents

Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, and Hongyuan Zha. Learning to Incentivize Other Learning Agents . In Advances in Neural Information Processing Systems , volume 33, pages 15208--15219. Curran Associates, Inc. , 2020

work page 2020
[30]

The Surprising Effectiveness of PPO in Cooperative , Multi-Agent Games , November 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The Surprising Effectiveness of PPO in Cooperative , Multi-Agent Games , November 2022

work page 2022
[31]

Model- Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Kaiqing Zhang, Sham Kakade, Tamer Basar, and Lin Yang. Model- Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity . In Advances in Neural Information Processing Systems , volume 33, pages 1166--1178. Curran Associates, Inc. , 2020

work page 2020
[32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Csaba Szepesv \'a ri and Michael L. Littman. Generalized markov decision processes: Dynamic-programming and reinforcement-learning algorithms. Technical Report, 0 (November): 0 1--54, 1996

work page 1996

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[3] [3]

Deep Q-Learning for Nash Equilibria : Nash-DQN

Philippe Casgrain, Brian Ning, and Sebastian Jaimungal. Deep Q-Learning for Nash Equilibria : Nash-DQN . Applied Mathematical Finance, 29 0 (1): 0 62--78, January 2022

work page 2022

[4] [4]

Goldberg, and Christos H

Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The Complexity of Computing a Nash Equilibrium . SIAM Journal on Computing, 39 0 (1): 0 195--259, January 2009

work page 2009

[5] [5]

On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games , September 2021

Xiaotie Deng, Yuhao Li, David Henry Mguni, Jun Wang, and Yaodong Yang. On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games , September 2021

work page 2021

[6] [6]

Correlated Q-learning

Amy Greenwald, Keith Hall, and Roberto Serrano. Correlated Q-learning . In ICML , volume 3, pages 242--249, 2003

work page 2003

[7] [7]

Multi-agent deep reinforcement learning: A survey

Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: A survey. Artificial Intelligence Review, 55 0 (2): 0 895--943, 2022

work page 2022

[8] [8]

Harsanyi

John C. Harsanyi. A Simplified Bargaining Model for the n- Person Cooperative Game . International Economic Review, 4 0 (2): 0 194--220, 1963

work page 1963

[9] [9]

Pablo Hernandez-Leal , Bilal Kartal, and Matthew E. Taylor. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 33 0 (6): 0 750--797, November 2019

work page 2019

[10] [10]

Junling Hu and Michael P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of machine learning research, 4 0 (Nov): 0 1039--1069, 2003

work page 2003

[11] [11]

Anthony, Tom Eccles, Joel Z

Edward Hughes, Thomas W. Anthony, Tom Eccles, Joel Z. Leibo, David Balduzzi, and Yoram Bachrach. Learning to Resolve Alliance Dilemmas in Many-Player Zero-Sum Games , February 2020

work page 2020

[12] [12]

Cooperation in Strategic Games Revisited *

Adam Kalai and Ehud Kalai. Cooperation in Strategic Games Revisited *. The Quarterly Journal of Economics, 128 0 (2): 0 917--966, May 2013

work page 2013

[13] [13]

Kalai and R

E. Kalai and R. W. Rosenthal. Arbitration of two-party disputes under ignorance. International Journal of Game Theory, 7 0 (2): 0 65--72, June 1978

work page 1978

[14] [14]

Cooperative strategic games

Elon Kohlberg and Abraham Neyman. Cooperative strategic games. Theoretical Economics, 16 0 (3): 0 825--851, 2021

work page 2021

[15] [15]

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In William W. Cohen and Haym Hirsh, editors, Machine Learning Proceedings 1994 , pages 157--163. Morgan Kaufmann , San Francisco (CA) , January 1994

work page 1994

[16] [16]

Michael L. Littman. Friend-or-foe Q-learning in general-sum games. In ICML , volume 1, pages 322--328, 2001

work page 2001

[17] [17]

Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games

Weichao Mao and Tamer Ba s ar. Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games . Dynamic Games and Applications, January 2022

work page 2022

[18] [18]

On improving model-free algorithms for decentralized multi-agent reinforcement learning

Weichao Mao, Lin Yang, Kaiqing Zhang, and Tamer Basar. On improving model-free algorithms for decentralized multi-agent reinforcement learning. In International Conference on Machine Learning , pages 15007--15049. PMLR , 2022

work page 2022

[19] [19]

Two- Person Cooperative Games

John Nash. Two- Person Cooperative Games . Econometrica, 21 0 (1): 0 128--140, 1953

work page 1953

[20] [20]

If multi-agent learning is the answer, what is the question? Artificial intelligence, 171 0 (7): 0 365--377, 2007

Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, what is the question? Artificial intelligence, 171 0 (7): 0 365--377, 2007

work page 2007

[21] [21]

Coco- Q : Learning in Stochastic Games with Side Payments

Eric Sodomka, Elizabeth Hilliard, Michael Littman, and Amy Greenwald. Coco- Q : Learning in Stochastic Games with Side Payments . In Proceedings of the 30th International Conference on Machine Learning , pages 1471--1479. PMLR , May 2013

work page 2013

[22] [22]

When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently ? arXiv preprint arXiv:2110.04184, 2021

Ziang Song, Song Mei, and Yu Bai. When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently ? arXiv preprint arXiv:2110.04184, 2021

work page arXiv 2021

[23] [23]

A Friend-or-Foe framework for multi-agent reinforcement learning policy generation in mixing cooperative competitive scenarios

Yu Sun, Jun Lai, Lei Cao, Xiliang Chen, Zhixiong Xu, Zhen Lian, and Huijin Fan. A Friend-or-Foe framework for multi-agent reinforcement learning policy generation in mixing cooperative competitive scenarios. Transactions of the Institute of Measurement and Control, page 01423312221077755, 2022

work page 2022

[24] [24]

Online learning in unknown markov games

Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Online learning in unknown markov games. In International Conference on Machine Learning, pages 10279--10288. PMLR , 2021

work page 2021

[25] [25]

Von Neumann and O

J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Theory of Games and Economic Behavior. Princeton University Press , Princeton, NJ, US , 1944

work page 1944

[26] [26]

Wang, Edward Hughes, Chrisantha Fernando, Wojciech M

Jane X. Wang, Edward Hughes, Chrisantha Fernando, Wojciech M. Czarnecki, Edgar A. Duenez-Guzman , and Joel Z. Leibo. Evolving intrinsic motivations for altruistic behavior, March 2019

work page 2019

[27] [27]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8 0 (3): 0 279--292, May 1992

work page 1992

[28] [28]

Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium

Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on Learning Theory, pages 3674--3682. PMLR , 2020

work page 2020

[29] [29]

Learning to Incentivize Other Learning Agents

Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, and Hongyuan Zha. Learning to Incentivize Other Learning Agents . In Advances in Neural Information Processing Systems , volume 33, pages 15208--15219. Curran Associates, Inc. , 2020

work page 2020

[30] [30]

The Surprising Effectiveness of PPO in Cooperative , Multi-Agent Games , November 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The Surprising Effectiveness of PPO in Cooperative , Multi-Agent Games , November 2022

work page 2022

[31] [31]

Model- Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Kaiqing Zhang, Sham Kakade, Tamer Basar, and Lin Yang. Model- Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity . In Advances in Neural Information Processing Systems , volume 33, pages 1166--1178. Curran Associates, Inc. , 2020

work page 2020

[32] [32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page