The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection

Benedict Russell; Chin-wing Leung; Paolo Turrini

arxiv: 2605.18185 · v1 · pith:NIAGMWQYnew · submitted 2026-05-18 · 💻 cs.MA

The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection

Benedict Russell , Chin-wing Leung , Paolo Turrini This is my paper

Pith reviewed 2026-05-20 00:18 UTC · model grok-4.3

classification 💻 cs.MA

keywords policy gradientsocial dilemmaspartner selectioncooperation emergencemulti-agent learningopponent distributionWiener processstationary distribution

0 comments

The pith

Partner selection changes opponent distributions to promote cooperation in policy-gradient social dilemmas when population variance is present.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an analytical model for how partner selection influences the learning dynamics of self-interested agents playing social dilemmas. It demonstrates that selection alters the distribution of encountered opponents, which in turn reshapes the reward landscape in a way that favors cooperation according to established rules. The work identifies population variance as a necessary condition for cooperation to arise and uses a two-dimensional Wiener process to incorporate stochastic effects from random partner encounters. Simulations validate that the resulting model matches observed policy-gradient behavior and shows how learning rates influence whether cooperation stabilizes.

Core claim

Partner selection modifies the opponent distribution and thereby the reward landscape faced by policy-gradient learners, which promotes cooperation under simple rules from the literature. Population variance is a necessary condition for cooperation to emerge. A two-dimensional Wiener process captures the stochastic effects of partner selection, yielding a sufficient condition for the population to be cooperation-promoting and proving the existence of a stationary distribution.

What carries the argument

The shift in opponent distribution induced by partner selection, modeled through a two-dimensional Wiener process to represent stochastic encounters.

If this is right

Cooperation emerges reliably in populations that maintain variance under partner selection.
The stochastic model accurately reproduces the full policy-gradient dynamics observed in simulations.
The learning rate controls the speed and stability of the transition to cooperation.
A derived sufficient condition identifies which populations will be cooperation-promoting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distribution-shift mechanism might be tested in other multi-agent learning algorithms to check generality beyond policy gradients.
Engineering environments with controlled variance could be explored as a design lever for encouraging cooperation in applied settings.
The stationary distribution result suggests long-run statistical predictions for agent behavior that could be checked against empirical multi-agent data.

Load-bearing premise

Partner selection effects can be fully captured by shifts in opponent distribution, and a two-dimensional Wiener process adequately models the stochastic encounters so that prior simple rules apply directly.

What would settle it

A controlled simulation in which population variance is set to zero yet cooperation still emerges and persists under partner selection and policy-gradient updates would contradict the necessity claim.

Figures

Figures reproduced from arXiv: 2605.18185 by Benedict Russell, Chin-wing Leung, Paolo Turrini.

**Figure 2.** Figure 2: Evolution of the strategy distribution under OFT where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the strategy distribution where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of the strategy distribution where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

read the original abstract

In social dilemmas self-interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent-based simulations. In this paper, we provide an analytical solution to the problem, studying the policy-gradient dynamics in a multi-agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two-dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation-promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy-gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes policy-gradient dynamics in multi-agent social dilemmas with partner selection. It claims that partner selection alters the opponent distribution and reward landscape to promote cooperation under known rules from the literature, with population variance as a necessary condition for cooperation to emerge. The deterministic dynamics are extended via a two-dimensional Wiener process to model stochastic opponent encounters, yielding a sufficient condition for a cooperation-promoting population, a proof of stationary distribution existence, and simulation validation that the stochastic model captures the dynamics and learning-rate effects on cooperation.

Significance. If the derivations and proofs hold, the work would provide a valuable analytical bridge between simulation-based evidence on assortment mechanisms and policy-gradient learning in social dilemmas. It would establish variance as a necessary condition and offer diffusion-based conditions for stationary cooperative outcomes, strengthening theoretical understanding in multi-agent RL.

major comments (2)

[§4] §4 (stochastic extension via 2D Wiener process): The derivation of the sufficient condition for a cooperation-promoting population and the proof of stationary distribution existence both rest on this diffusion approximation for partner selection. The approximation assumes independent stochastic encounters that may fail to preserve discrete matching correlations or finite-population effects inherent in actual partner selection; if so, the necessity of population variance for cooperation does not transfer to the original multi-agent system.
[Abstract and §3–4] The claim that population variance is a necessary condition (stated in the abstract and derived from opponent-distribution shifts): This is load-bearing for the central result, yet its validity depends on the Wiener process accurately reproducing the higher-order statistics of partner selection; without explicit verification against the discrete matching process (e.g., via comparison of moments or simulation of finite-N effects), the necessity result remains conditional on the approximation.

minor comments (2)

[Abstract] The abstract refers to 'simple rules known from the literature' without naming them; these should be explicitly cited in the introduction or model section for clarity.
[§4] Notation for the two-dimensional Wiener process and its drift/diffusion terms could be introduced earlier with a clear link to the deterministic policy-gradient equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope of our analytical results. We address each major comment below, clarifying the separation between our deterministic analysis and the stochastic extension while committing to revisions that strengthen the validation of the approximation.

read point-by-point responses

Referee: [§4] §4 (stochastic extension via 2D Wiener process): The derivation of the sufficient condition for a cooperation-promoting population and the proof of stationary distribution existence both rest on this diffusion approximation for partner selection. The approximation assumes independent stochastic encounters that may fail to preserve discrete matching correlations or finite-population effects inherent in actual partner selection; if so, the necessity of population variance for cooperation does not transfer to the original multi-agent system.

Authors: We appreciate the referee highlighting the limitations of the diffusion approximation. We clarify that the necessity of population variance for cooperation emergence is derived analytically from the deterministic policy-gradient dynamics under partner selection (Section 3), based on the induced shifts in opponent distribution; this result is established independently of the stochastic model. The two-dimensional Wiener process in Section 4 is introduced afterward specifically to obtain a sufficient condition for cooperation-promoting populations and to prove existence of a stationary distribution. We agree that the approximation idealizes encounters as independent and may not capture all higher-order correlations or finite-population effects present in discrete partner selection. Accordingly, we will revise the manuscript to expand the discussion of the diffusion approximation's assumptions, its relation to the discrete process, and to include additional simulations examining finite-N effects and correlation preservation. revision: partial
Referee: [Abstract and §3–4] The claim that population variance is a necessary condition (stated in the abstract and derived from opponent-distribution shifts): This is load-bearing for the central result, yet its validity depends on the Wiener process accurately reproducing the higher-order statistics of partner selection; without explicit verification against the discrete matching process (e.g., via comparison of moments or simulation of finite-N effects), the necessity result remains conditional on the approximation.

Authors: We note that the necessity claim is obtained from the deterministic analysis of opponent-distribution shifts due to partner selection (Section 3 and appendix proofs) and does not rely on the Wiener process, which is used only for the subsequent stochastic extension and sufficient-condition derivation. The manuscript already reports simulations showing that the stochastic model captures the overall policy-gradient dynamics and learning-rate effects. To directly respond to the concern about higher-order statistics, we will add explicit moment comparisons between the discrete partner-selection process and the diffusion approximation, together with finite-population simulations, thereby providing the requested verification and removing any conditionality on the approximation for the necessity result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained analytical modeling

full rationale

The paper constructs an explicit stochastic model via a two-dimensional Wiener process to approximate partner selection effects on opponent distributions, then derives a sufficient condition for cooperation promotion and proves existence of a stationary distribution from the resulting Fokker-Planck or Kolmogorov forward equations. These steps are forward derivations from the stated diffusion approximation and the imported simple rules from the literature; they do not reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. The necessity of population variance follows from the variance term in the derived drift or diffusion coefficients rather than being presupposed. No quoted equation equates a claimed prediction directly to an input fit or prior self-result. The analysis therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the model appears to rest on standard policy-gradient assumptions and a Wiener-process approximation whose details are not visible.

pith-pipeline@v0.9.0 · 5699 in / 1186 out tokens · 50030 ms · 2026-05-20T00:18:10.978708+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

population variance is a necessary condition for cooperation to emerge

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 2 internal anchors

[1]

Anastassacos, S

N. Anastassacos, S. Hailes, and M. Musolesi. Partner Selection for the Emergence of Cooperation in Multi-Agent Systems Using Reinforcement Learning, Feb. 2020. URL https://aaai.org/Library/conferences-library.php. Conference Name: Thirty- Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Meeting Name: Thirty-Fourth AAAI Conference on Artificial ...

work page 2020
[2]

J. Bara, P. Turrini, and G. Andrighetto. Enabling imitation-based cooperation in dy- namic social networks.Autonomous Agents and Multi-Agent Systems, 36(2):34, May 2022. ISSN 1573-7454. doi:10.1007/s10458-022-09562-w. URL https://doi.org/10.1007/ s10458-022-09562-w

work page doi:10.1007/s10458-022-09562-w 2022
[3]

Bernasconi, F

M. Bernasconi, F. Cacciamani, S. Fioravanti, N. Gatti, and F. Trovò. The evolutionary dynamics of soft-max policy gradient in multi-agent settings.Theoretical Computer Science, 1027: 115011, Feb. 2025. ISSN 0304-3975. doi:10.1016/j.tcs.2024.115011. URL https://www. sciencedirect.com/science/article/pii/S0304397524006285

work page doi:10.1016/j.tcs.2024.115011 2025
[4]

Billingsley.Convergence of probability measures

P. Billingsley.Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication

work page 1999
[5]

Journal of Artificial Intelligence Research53, 659–697 (2015) https://doi.org/10.1613/jair.4818 15

D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers. Evolutionary Dynamics of Multi-Agent Learning: A Survey.Journal of Artificial Intelligence Research, 53:659–697, Aug. 2015. doi:10.1613/jair.4818

work page doi:10.1613/jair.4818 2015
[6]

Bonnet and F

B. Bonnet and F. Rossi. The pontryagin maximum principle in the wasserstein space.Calculus of Variations and Partial Differential Equations, 58(1):11, Dec. 2018. ISSN 0944-2669, 1432-0835. doi:10.1007/s00526-018-1447-2. URLhttp://arxiv.org/abs/1711.07667

work page doi:10.1007/s00526-018-1447-2 2018
[7]

C. Chu, Y . Li, J. Liu, S. Hu, X. Li, and Z. Wang. A Formal Model for Multiagent Q- Learning Dynamics on Regular Graphs. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 194–200, Vienna, Austria, July 2022. Inter- national Joint Conferences on Artificial Intelligence Organization. ISBN 978-1-956792-00-3. d...

work page doi:10.24963/ijcai.2022/28 2022
[8]

Fan, C.-w

X. Fan, C.-w. Leung, and P. Turrini. Co-learning of strategy and structure achieves full cooperation in complex networks with dynamical linking. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 72–80. International Joint Conferences on Artificial Intelligence Organization, Sep. 2025. doi:10.24963/ijcai.2025/9

work page doi:10.24963/ijcai.2025/9 2025
[9]

K. Fehl, D. J. van der Post, and D. Semmann. Co-evolution of behaviour and social network structure promotes human cooperation.Ecology Letters, 14(6):546–551, June 2011. ISSN 1461-0248. doi:10.1111/j.1461-0248.2011.01615.x

work page doi:10.1111/j.1461-0248.2011.01615.x 2011
[10]

F. Fu, T. Wu, and L. Wang. Partner switching stabilizes cooperation in coevolutionary pris- oner’s dilemma.Physical Review E, 79(3):036101, Mar. 2009. ISSN 1539-3755, 1550-

work page 2009
[11]

URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101

doi:10.1103/PhysRevE.79.036101. URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101

work page doi:10.1103/physreve.79.036101
[12]

Fudenberg and C

D. Fudenberg and C. Harris. Evolutionary dynamics with aggregate shocks.Journal of Economic Theory, 57(2):420–441, Aug. 1992. ISSN 00220531. doi:10.1016/0022-0531(92)90044-I. URL https://linkinghub.elsevier.com/retrieve/pii/002205319290044I

work page doi:10.1016/0022-0531(92)90044-i 1992
[13]

Galstyan

A. Galstyan. Continuous strategy replicator dynamics for multi-agent Q-learning.Autonomous Agents and Multi-Agent Systems, 26(1):37–53, Jan. 2013. ISSN 1573-7454. doi:10.1007/s10458- 011-9181-6. URLhttps://doi.org/10.1007/s10458-011-9181-6

work page doi:10.1007/s10458- 2013
[14]

S. Hu, C. wing Leung, and H. fung Leung. Modelling the dynamics of multiagent q-learning in repeated symmetric games: a mean field theoretic approach. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:202768371. 11

work page 2019
[15]

Hu, C.-w

S. Hu, C.-w. Leung, H.-f. Leung, and H. Soh. The dynamics of q-learning in population games: A physics-inspired continuity equation model.arXiv preprint arXiv:2203.01500, 2022

work page arXiv 2022
[16]

K. Itô. Stochastic integral.Proceedings of the Imperial Academy, 20(8): 519–524, Jan. 1944. ISSN 0369-9846. doi:10.3792/pia/1195572786. URL https://projecteuclid.org/journals/proceedings-of-the-imperial-academy/ volume-20/issue-8/Stochastic-integral/10.3792/pia/1195572786.full

work page doi:10.3792/pia/1195572786 1944
[17]

L. R. Izquierdo, S. S. Izquierdo, and F. Vega-Redondo. Leave and let leave: A sufficient condition to explain the evolutionary emergence of cooperation.Journal of Economic Dynamics and Control, 46:91–113, Sept. 2014. ISSN 0165-1889. doi:10.1016/j.jedc.2014.06.007. URL https://www.sciencedirect.com/science/article/pii/S0165188914001456

work page doi:10.1016/j.jedc.2014.06.007 2014
[18]

S. S. Izquierdo, L. R. Izquierdo, and F. Vega-Redondo. The option to leave: Conditional dissociation in the evolution of cooperation.Journal of Theoretical Biology, 267(1):76–84, Nov

work page
[19]

doi:10.1016/j.jtbi.2010.07.039

ISSN 0022-5193. doi:10.1016/j.jtbi.2010.07.039. URL https://www.sciencedirect. com/science/article/pii/S0022519310003966

work page doi:10.1016/j.jtbi.2010.07.039 2010
[20]

Y . Kifer. Random Perturbations of Dynamical Systems, 1988. URLhttps://link.springer. com/book/10.1007/978-1-4615-8181-9

work page doi:10.1007/978-1-4615-8181-9 1988
[21]

Leung and P

C.-w. Leung and P. Turrini. Learning partner selection rules that sustain cooperation in social dilemmas with the option of opting out. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024

work page 2024
[22]

Leung, S

C.-w. Leung, S. Hu, and H.-f. Leung. Modelling the Dynamics of Multi-Agent Q-learning: The Stochastic Effects of Local Interaction and Incomplete Information. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 384–390, Vienna, Austria, July 2022. International Joint Conferences on Artificial Intelligence Org...

work page doi:10.24963/ijcai.2022/55 2022
[23]

Leung, S

C.-w. Leung, S. Hu, and H.-f. Leung. The stochastic evolutionary dynamics of softmax policy gradient in games. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024

work page 2024
[24]

Leung, P

C.-w. Leung, P. Turrini, and A. Nowé. Curiosity-driven partner selection accelerates convention emergence in language games. InProc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), 2025

work page 2025
[25]

Leung, P

C.-w. Leung, P. Turrini, F. P. Santos, and M. Musolesi. Learning to cooperate with minimal observability. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29521–29529, 2026

work page 2026
[26]

Z. Li, Z. Yang, T. Wu, and L. Wang. Aspiration-Based Partner Switching Boosts Co- operation in Social Dilemmas.PLOS ONE, 9(6):e97866, June 2014. ISSN 1932-

work page 2014
[27]

URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866

doi:10.1371/journal.pone.0097866. URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866

work page doi:10.1371/journal.pone.0097866
[28]

The emergence of rational behavior in the presence of stochastic perturbations

P. Mertikopoulos and A. L. Moustakas. The emergence of rational behavior in the presence of stochastic perturbations.The Annals of Applied Probability, 20(4), Aug. 2010. ISSN 1050-5164. doi:10.1214/09-AAP651. URL http://arxiv.org/abs/0906.2094. arXiv:0906.2094 [math]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1214/09-aap651 2010
[29]

J. Nash. Non-Cooperative Games.Annals of Mathematics, 54(2):286–295, 1951. ISSN 0003-486X. doi:10.2307/1969529. URLhttps://www.jstor.org/stable/1969529

work page doi:10.2307/1969529 1951
[30]

Nguyen, H

D. Nguyen, H. Le, K. Do, S. Gupta, S. Venkatesh, and T. Tran. Navigating social dilemmas with llm-based agents via consideration of future consequences. In J. Kwok, editor,Proceed- ings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 223–231. International Joint Conferences on Artificial Intelligence Organiz...

work page doi:10.24963/ijcai.2025/26 2025
[31]

M. A. Nowak. Five rules for the evolution of cooperation.Science (New York, N.y.), 314 (5805):1560–1563, Dec. 2006. ISSN 0036-8075. doi:10.1126/science.1133755. URL https: //pmc.ncbi.nlm.nih.gov/articles/PMC3279745/

work page doi:10.1126/science.1133755 2006
[32]

J. M. Pacheco, A. Traulsen, and M. A. Nowak. Active linking in evolutionary games.Journal of Theoretical Biology, 243(3):437–443, Dec. 2006. ISSN 0022-

work page 2006
[33]

URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736

doi:10.1016/j.jtbi.2006.06.027. URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736

work page doi:10.1016/j.jtbi.2006.06.027 2006
[34]

J. M. Pacheco, A. Traulsen, and M. A. Nowak. Coevolution of Strategy and Structure in Complex Networks with Dynamical Linking.Physical Review Letters, 97(25):258103, Dec

work page
[35]

doi:10.1103/PhysRevLett.97.258103

ISSN 0031-9007, 1079-7114. doi:10.1103/PhysRevLett.97.258103. URL https: //link.aps.org/doi/10.1103/PhysRevLett.97.258103

work page doi:10.1103/physrevlett.97.258103
[36]

J. M. Pacheco, A. Traulsen, H. Ohtsuki, and M. A. Nowak. Repeated games and direct reciprocity under active linking.Journal of Theoretical Biology, 250(4):723–731, Feb. 2008. ISSN 00225193. doi:10.1016/j.jtbi.2007.10.040. URL https://linkinghub.elsevier. com/retrieve/pii/S0022519307005450

work page doi:10.1016/j.jtbi.2007.10.040 2008
[37]

Priklopil, K

T. Priklopil, K. Chatterjee, and M. Nowak. Optional interactions and suspicious behaviour facilitates trustful cooperation in prisoners dilemma.Journal of Theoretical Biology, 433: 64–72, Nov. 2017. ISSN 0022-5193. doi:10.1016/j.jtbi.2017.08.025. URL https://www. sciencedirect.com/science/article/pii/S0022519317303995

work page doi:10.1016/j.jtbi.2017.08.025 2017
[38]

D. G. Rand, S. Arbesman, and N. A. Christakis. Dynamic social networks promote cooperation in experiments with humans.Proceedings of the National Academy of Sciences, 108(48): 19193–19198, Nov. 2011. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1108243108. URL https://pnas.org/doi/full/10.1073/pnas.1108243108

work page doi:10.1073/pnas.1108243108 2011
[39]

H. Risken. Fokker-Planck Equation. In H. Risken, editor,The Fokker-Planck Equation: Methods of Solution and Applications, pages 63–95. Springer, Berlin, Heidelberg, 1996. ISBN 978-3-642-61544-3. doi:10.1007/978-3-642-61544-3_4. URL https://doi.org/10.1007/ 978-3-642-61544-3_4

work page doi:10.1007/978-3-642-61544-3_4 1996
[40]

Rudin.Principles of Mathematical Analysis

W. Rudin.Principles of Mathematical Analysis. McGraw-Hill, 3 edition, 1976. ISBN 978- 0070856134

work page 1976
[41]

Russell, C.-w

B. Russell, C.-w. Leung, and P. Turrini. Defection at first sight: learning partner selection in optional social dilemmas without prior information. In25th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; ACM Digital library, May 2026. In Press

work page 2026
[42]

Sabater-Mir and C

J. Sabater-Mir and C. Sierra. Reputation and social network analysis in multi-agent systems. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, pages 475–482, July 2002. doi:10.1145/544741.544854

work page doi:10.1145/544741.544854 2002
[43]

F. C. Santos, J. M. Pacheco, and T. Lenaerts. Cooperation Prevails When Individuals Adjust Their Social Ties.PLoS Computational Biology, 2(10):e140, Oct. 2006. ISSN 1553-7358. doi:10.1371/journal.pcbi.0020140. URL https://dx.plos.org/10.1371/journal.pcbi. 0020140

work page doi:10.1371/journal.pcbi.0020140 2006
[44]

Sato and J

Y . Sato and J. P. Crutchfield. Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems.Physical Review E, 67(1):015206, Jan. 2003. ISSN 1063-651X, 1095-

work page 2003
[45]

Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems

doi:10.1103/PhysRevE.67.015206. URL http://arxiv.org/abs/nlin/0204057. arXiv:nlin/0204057

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.67.015206
[46]

Shapiro.A Fixed-Point Farrago

J. Shapiro.A Fixed-Point Farrago. Springer, 01 2016. ISBN 978-3-319-27976-3. doi:10.1007/978-3-319-27978-7

work page doi:10.1007/978-3-319-27978-7 2016
[47]

J. M. Smith.Evolution and the Theory of Games. Cambridge University Press, Cam- bridge, 1982. ISBN 978-0-521-28884-2. doi:10.1017/CBO9780511806292. URL https://www.cambridge.org/core/books/evolution-and-the-theory-of-games/ A3BDF54AF5C6297E308AB15BBEF45E48. 13

work page doi:10.1017/cbo9780511806292 1982
[48]

Srinivasan, M

S. Srinivasan, M. Lanctot, V . Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments.Advances in neural information processing systems, 31, 2018

work page 2018
[49]

R. S. Sutton and A. G. Barto. Reinforcement learning - an introduction, 2nd edition. 2018. URLhttps://api.semanticscholar.org/CorpusID:277058247

work page 2018
[50]

Tuyls, K

K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model for q-learning in multi- agent systems. InProceedings of the second international joint conference on Autonomous agents and multiagent systems, AAMAS ’03, pages 693–700, New York, NY , USA, July 2003. Association for Computing Machinery. ISBN 978-1-58113-683-8. doi:10.1145/860575.860687. U...

work page doi:10.1145/860575.860687 2003
[51]

Villani et al.Optimal transport: old and new, volume 338

C. Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

work page 2009
[52]

J. Wang, S. Suri, and D. J. Watts. Cooperation and assortativity with dynamic partner updating.Proceedings of the National Academy of Sciences, 109(36):14363–14368, Sept

work page
[53]

URL https://www.pnas.org/doi/10.1073/pnas

doi:10.1073/pnas.1120867109. URL https://www.pnas.org/doi/10.1073/pnas. 1120867109

work page doi:10.1073/pnas.1120867109
[54]

R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning.Machine Learning, 8(3-4):229–256, May 1992. ISSN 0885-6125, 1573-

work page 1992
[55]

URL https://link.springer.com/10.1023/A: 1022672621406

doi:10.1023/A:1022672621406. URL https://link.springer.com/10.1023/A: 1022672621406

work page doi:10.1023/a:1022672621406
[56]

Will systems of llm agents cooperate: An investigation into a social dilemma,

R. Willis, Y . Du, J. Z. Leibo, and M. Luck. Will Systems of LLM Agents Cooperate: An Inves- tigation into a Social Dilemma, Jan. 2025. URLhttps://arxiv.org/abs/2501.16173v1

work page arXiv 2025
[57]

Zhang, S.-J

B.-Y . Zhang, S.-J. Fan, C. Li, X.-D. Zheng, J.-Z. Bao, R. Cressman, and Y . Tao. Opting out against defection leads to stable coexistence with cooperation.Scientific Reports, 6:35902, Oct

work page
[58]

doi:10.1038/srep35902

ISSN 2045-2322. doi:10.1038/srep35902. URL https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC5075917/

work page doi:10.1038/srep35902 2045
[59]

Zheng, C

X.-D. Zheng, C. Li, J.-R. Yu, S.-C. Wang, S.-J. Fan, B.-Y . Zhang, and Y . Tao. A simple rule of direct reciprocity leads to the stable coexistence of cooperation and defection in the Prisoner’s Dilemma game.Journal of Theoretical Biology, 420:12–17, May 2017. ISSN 00225193. doi:10.1016/j.jtbi.2017.02.036. URL https://linkinghub.elsevier.com/ retrieve/pii...

work page doi:10.1016/j.jtbi.2017.02.036 2017
[60]

+c 2x(1−x),Var(r 1|a0 =D) =b 2µ1(1−µ 1) +c 2x(1−x). The covariance terms are Cov(r0, r1|a0 =C) =b 2Cov(Y0, Y1|a0 =C) =b 2(E[Y0Y1|a0 =C]−E[Y 0|a0 =C]E[Y 1|a0 =C]) =b 2(µ3 +µ 2 1 −µ 2µ1 −µ 1(µ2 +µ 1 −µ 2 1)) =b 2(µ3 −2µ 2µ1 +µ 3 1) Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|a0 =D]) =b 2(µ2 1 −µ 2

work page
[61]

= 0 It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[xyρ(y) +ρ(y)(1−xµ 1)]dy =b(µ 1 +x(µ 2 −µ 2 1)) E[r1|a1 =D] =bm 1(x) +c =b(µ 1 +x(µ 2 −µ 2 1)) +c Var(r1|a1 =a) =b 2m1(x)(1−m 1(x)) =b 2(µ1 +x(µ 2 −µ 2 1))(1−(µ 1 +x(µ 2 −µ 2 1))) Collating these terms, we can summarise the elements of th...

work page
[62]

+ (µ1 −µ 2 1) + 2(µ3 −2µ 2µ1 +µ 3 1)] +c 2x(1−x) [b(µ2 + 2µ1 −µ 2

work page
[63]

+c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1))−β] 2 D b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1)) +c−β] 2 18 ROFTThe expected cooperation rate of the opponent in the next step given the current opponent and the focal agent’s action is E[Y1|Y0, a0 =C]...

work page
[64]

= 0 Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|D]) =b 2(µ2 −µ 3 +µ 1µ2 −µ 1(µ1 +µ 2 1 −µ 2)) =b 2(µ2 −µ 2 1 + 2µ1µ2 −µ 3 1 −µ 3) It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[(1−x)(1−y)ρ(y) +ρ(y)(1−(1−x)(1−µ 1))]dy =b(µ 1 −(1−x)(µ 2 −µ 2 1)) E[r1|a1 =D]...

work page
[65]

+c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2(µ1 +µ 2 −2µ 2

work page
[66]

+c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2µ1(1−µ 1) [bµ 1 −β] 2 D b 2µ1(1−µ 1) [bµ 1 +c−β] 2 Always SwitchSince the transition is independent of the action, for both actions a∈ {C, D} the expected cooperation rate of the opponent in the next step given the current opponent and action is E[Y1|Y0, a0 =a] =µ 1. Then computing the conditional mean, E[Y1|a] =E[E[Y ...

work page
[67]

Collating these terms, we can summarise the elements of the second moment in Table 9

= 0 Note that for conditioning at step h= 1 , the same derivation as Always Stay holds. Collating these terms, we can summarise the elements of the second moment in Table 9. Table 9: Second moments (Sh a ) of episodic reward whenH= 2under Always Switch. h aVar(R h|ah =a) (G h a −β) 2 0C2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x)...

work page

[1] [1]

Anastassacos, S

N. Anastassacos, S. Hailes, and M. Musolesi. Partner Selection for the Emergence of Cooperation in Multi-Agent Systems Using Reinforcement Learning, Feb. 2020. URL https://aaai.org/Library/conferences-library.php. Conference Name: Thirty- Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Meeting Name: Thirty-Fourth AAAI Conference on Artificial ...

work page 2020

[2] [2]

J. Bara, P. Turrini, and G. Andrighetto. Enabling imitation-based cooperation in dy- namic social networks.Autonomous Agents and Multi-Agent Systems, 36(2):34, May 2022. ISSN 1573-7454. doi:10.1007/s10458-022-09562-w. URL https://doi.org/10.1007/ s10458-022-09562-w

work page doi:10.1007/s10458-022-09562-w 2022

[3] [3]

Bernasconi, F

M. Bernasconi, F. Cacciamani, S. Fioravanti, N. Gatti, and F. Trovò. The evolutionary dynamics of soft-max policy gradient in multi-agent settings.Theoretical Computer Science, 1027: 115011, Feb. 2025. ISSN 0304-3975. doi:10.1016/j.tcs.2024.115011. URL https://www. sciencedirect.com/science/article/pii/S0304397524006285

work page doi:10.1016/j.tcs.2024.115011 2025

[4] [4]

Billingsley.Convergence of probability measures

P. Billingsley.Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication

work page 1999

[5] [5]

Journal of Artificial Intelligence Research53, 659–697 (2015) https://doi.org/10.1613/jair.4818 15

D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers. Evolutionary Dynamics of Multi-Agent Learning: A Survey.Journal of Artificial Intelligence Research, 53:659–697, Aug. 2015. doi:10.1613/jair.4818

work page doi:10.1613/jair.4818 2015

[6] [6]

Bonnet and F

B. Bonnet and F. Rossi. The pontryagin maximum principle in the wasserstein space.Calculus of Variations and Partial Differential Equations, 58(1):11, Dec. 2018. ISSN 0944-2669, 1432-0835. doi:10.1007/s00526-018-1447-2. URLhttp://arxiv.org/abs/1711.07667

work page doi:10.1007/s00526-018-1447-2 2018

[7] [7]

C. Chu, Y . Li, J. Liu, S. Hu, X. Li, and Z. Wang. A Formal Model for Multiagent Q- Learning Dynamics on Regular Graphs. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 194–200, Vienna, Austria, July 2022. Inter- national Joint Conferences on Artificial Intelligence Organization. ISBN 978-1-956792-00-3. d...

work page doi:10.24963/ijcai.2022/28 2022

[8] [8]

Fan, C.-w

X. Fan, C.-w. Leung, and P. Turrini. Co-learning of strategy and structure achieves full cooperation in complex networks with dynamical linking. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 72–80. International Joint Conferences on Artificial Intelligence Organization, Sep. 2025. doi:10.24963/ijcai.2025/9

work page doi:10.24963/ijcai.2025/9 2025

[9] [9]

K. Fehl, D. J. van der Post, and D. Semmann. Co-evolution of behaviour and social network structure promotes human cooperation.Ecology Letters, 14(6):546–551, June 2011. ISSN 1461-0248. doi:10.1111/j.1461-0248.2011.01615.x

work page doi:10.1111/j.1461-0248.2011.01615.x 2011

[10] [10]

F. Fu, T. Wu, and L. Wang. Partner switching stabilizes cooperation in coevolutionary pris- oner’s dilemma.Physical Review E, 79(3):036101, Mar. 2009. ISSN 1539-3755, 1550-

work page 2009

[11] [11]

URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101

doi:10.1103/PhysRevE.79.036101. URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101

work page doi:10.1103/physreve.79.036101

[12] [12]

Fudenberg and C

D. Fudenberg and C. Harris. Evolutionary dynamics with aggregate shocks.Journal of Economic Theory, 57(2):420–441, Aug. 1992. ISSN 00220531. doi:10.1016/0022-0531(92)90044-I. URL https://linkinghub.elsevier.com/retrieve/pii/002205319290044I

work page doi:10.1016/0022-0531(92)90044-i 1992

[13] [13]

Galstyan

A. Galstyan. Continuous strategy replicator dynamics for multi-agent Q-learning.Autonomous Agents and Multi-Agent Systems, 26(1):37–53, Jan. 2013. ISSN 1573-7454. doi:10.1007/s10458- 011-9181-6. URLhttps://doi.org/10.1007/s10458-011-9181-6

work page doi:10.1007/s10458- 2013

[14] [14]

S. Hu, C. wing Leung, and H. fung Leung. Modelling the dynamics of multiagent q-learning in repeated symmetric games: a mean field theoretic approach. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:202768371. 11

work page 2019

[15] [15]

Hu, C.-w

S. Hu, C.-w. Leung, H.-f. Leung, and H. Soh. The dynamics of q-learning in population games: A physics-inspired continuity equation model.arXiv preprint arXiv:2203.01500, 2022

work page arXiv 2022

[16] [16]

K. Itô. Stochastic integral.Proceedings of the Imperial Academy, 20(8): 519–524, Jan. 1944. ISSN 0369-9846. doi:10.3792/pia/1195572786. URL https://projecteuclid.org/journals/proceedings-of-the-imperial-academy/ volume-20/issue-8/Stochastic-integral/10.3792/pia/1195572786.full

work page doi:10.3792/pia/1195572786 1944

[17] [17]

L. R. Izquierdo, S. S. Izquierdo, and F. Vega-Redondo. Leave and let leave: A sufficient condition to explain the evolutionary emergence of cooperation.Journal of Economic Dynamics and Control, 46:91–113, Sept. 2014. ISSN 0165-1889. doi:10.1016/j.jedc.2014.06.007. URL https://www.sciencedirect.com/science/article/pii/S0165188914001456

work page doi:10.1016/j.jedc.2014.06.007 2014

[18] [18]

S. S. Izquierdo, L. R. Izquierdo, and F. Vega-Redondo. The option to leave: Conditional dissociation in the evolution of cooperation.Journal of Theoretical Biology, 267(1):76–84, Nov

work page

[19] [19]

doi:10.1016/j.jtbi.2010.07.039

ISSN 0022-5193. doi:10.1016/j.jtbi.2010.07.039. URL https://www.sciencedirect. com/science/article/pii/S0022519310003966

work page doi:10.1016/j.jtbi.2010.07.039 2010

[20] [20]

Y . Kifer. Random Perturbations of Dynamical Systems, 1988. URLhttps://link.springer. com/book/10.1007/978-1-4615-8181-9

work page doi:10.1007/978-1-4615-8181-9 1988

[21] [21]

Leung and P

C.-w. Leung and P. Turrini. Learning partner selection rules that sustain cooperation in social dilemmas with the option of opting out. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024

work page 2024

[22] [22]

Leung, S

C.-w. Leung, S. Hu, and H.-f. Leung. Modelling the Dynamics of Multi-Agent Q-learning: The Stochastic Effects of Local Interaction and Incomplete Information. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 384–390, Vienna, Austria, July 2022. International Joint Conferences on Artificial Intelligence Org...

work page doi:10.24963/ijcai.2022/55 2022

[23] [23]

Leung, S

C.-w. Leung, S. Hu, and H.-f. Leung. The stochastic evolutionary dynamics of softmax policy gradient in games. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024

work page 2024

[24] [24]

Leung, P

C.-w. Leung, P. Turrini, and A. Nowé. Curiosity-driven partner selection accelerates convention emergence in language games. InProc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), 2025

work page 2025

[25] [25]

Leung, P

C.-w. Leung, P. Turrini, F. P. Santos, and M. Musolesi. Learning to cooperate with minimal observability. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29521–29529, 2026

work page 2026

[26] [26]

Z. Li, Z. Yang, T. Wu, and L. Wang. Aspiration-Based Partner Switching Boosts Co- operation in Social Dilemmas.PLOS ONE, 9(6):e97866, June 2014. ISSN 1932-

work page 2014

[27] [27]

URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866

doi:10.1371/journal.pone.0097866. URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866

work page doi:10.1371/journal.pone.0097866

[28] [28]

The emergence of rational behavior in the presence of stochastic perturbations

P. Mertikopoulos and A. L. Moustakas. The emergence of rational behavior in the presence of stochastic perturbations.The Annals of Applied Probability, 20(4), Aug. 2010. ISSN 1050-5164. doi:10.1214/09-AAP651. URL http://arxiv.org/abs/0906.2094. arXiv:0906.2094 [math]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1214/09-aap651 2010

[29] [29]

J. Nash. Non-Cooperative Games.Annals of Mathematics, 54(2):286–295, 1951. ISSN 0003-486X. doi:10.2307/1969529. URLhttps://www.jstor.org/stable/1969529

work page doi:10.2307/1969529 1951

[30] [30]

Nguyen, H

D. Nguyen, H. Le, K. Do, S. Gupta, S. Venkatesh, and T. Tran. Navigating social dilemmas with llm-based agents via consideration of future consequences. In J. Kwok, editor,Proceed- ings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 223–231. International Joint Conferences on Artificial Intelligence Organiz...

work page doi:10.24963/ijcai.2025/26 2025

[31] [31]

M. A. Nowak. Five rules for the evolution of cooperation.Science (New York, N.y.), 314 (5805):1560–1563, Dec. 2006. ISSN 0036-8075. doi:10.1126/science.1133755. URL https: //pmc.ncbi.nlm.nih.gov/articles/PMC3279745/

work page doi:10.1126/science.1133755 2006

[32] [32]

J. M. Pacheco, A. Traulsen, and M. A. Nowak. Active linking in evolutionary games.Journal of Theoretical Biology, 243(3):437–443, Dec. 2006. ISSN 0022-

work page 2006

[33] [33]

URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736

doi:10.1016/j.jtbi.2006.06.027. URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736

work page doi:10.1016/j.jtbi.2006.06.027 2006

[34] [34]

J. M. Pacheco, A. Traulsen, and M. A. Nowak. Coevolution of Strategy and Structure in Complex Networks with Dynamical Linking.Physical Review Letters, 97(25):258103, Dec

work page

[35] [35]

doi:10.1103/PhysRevLett.97.258103

ISSN 0031-9007, 1079-7114. doi:10.1103/PhysRevLett.97.258103. URL https: //link.aps.org/doi/10.1103/PhysRevLett.97.258103

work page doi:10.1103/physrevlett.97.258103

[36] [36]

J. M. Pacheco, A. Traulsen, H. Ohtsuki, and M. A. Nowak. Repeated games and direct reciprocity under active linking.Journal of Theoretical Biology, 250(4):723–731, Feb. 2008. ISSN 00225193. doi:10.1016/j.jtbi.2007.10.040. URL https://linkinghub.elsevier. com/retrieve/pii/S0022519307005450

work page doi:10.1016/j.jtbi.2007.10.040 2008

[37] [37]

Priklopil, K

T. Priklopil, K. Chatterjee, and M. Nowak. Optional interactions and suspicious behaviour facilitates trustful cooperation in prisoners dilemma.Journal of Theoretical Biology, 433: 64–72, Nov. 2017. ISSN 0022-5193. doi:10.1016/j.jtbi.2017.08.025. URL https://www. sciencedirect.com/science/article/pii/S0022519317303995

work page doi:10.1016/j.jtbi.2017.08.025 2017

[38] [38]

D. G. Rand, S. Arbesman, and N. A. Christakis. Dynamic social networks promote cooperation in experiments with humans.Proceedings of the National Academy of Sciences, 108(48): 19193–19198, Nov. 2011. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1108243108. URL https://pnas.org/doi/full/10.1073/pnas.1108243108

work page doi:10.1073/pnas.1108243108 2011

[39] [39]

H. Risken. Fokker-Planck Equation. In H. Risken, editor,The Fokker-Planck Equation: Methods of Solution and Applications, pages 63–95. Springer, Berlin, Heidelberg, 1996. ISBN 978-3-642-61544-3. doi:10.1007/978-3-642-61544-3_4. URL https://doi.org/10.1007/ 978-3-642-61544-3_4

work page doi:10.1007/978-3-642-61544-3_4 1996

[40] [40]

Rudin.Principles of Mathematical Analysis

W. Rudin.Principles of Mathematical Analysis. McGraw-Hill, 3 edition, 1976. ISBN 978- 0070856134

work page 1976

[41] [41]

Russell, C.-w

B. Russell, C.-w. Leung, and P. Turrini. Defection at first sight: learning partner selection in optional social dilemmas without prior information. In25th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; ACM Digital library, May 2026. In Press

work page 2026

[42] [42]

Sabater-Mir and C

J. Sabater-Mir and C. Sierra. Reputation and social network analysis in multi-agent systems. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, pages 475–482, July 2002. doi:10.1145/544741.544854

work page doi:10.1145/544741.544854 2002

[43] [43]

F. C. Santos, J. M. Pacheco, and T. Lenaerts. Cooperation Prevails When Individuals Adjust Their Social Ties.PLoS Computational Biology, 2(10):e140, Oct. 2006. ISSN 1553-7358. doi:10.1371/journal.pcbi.0020140. URL https://dx.plos.org/10.1371/journal.pcbi. 0020140

work page doi:10.1371/journal.pcbi.0020140 2006

[44] [44]

Sato and J

Y . Sato and J. P. Crutchfield. Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems.Physical Review E, 67(1):015206, Jan. 2003. ISSN 1063-651X, 1095-

work page 2003

[45] [45]

Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems

doi:10.1103/PhysRevE.67.015206. URL http://arxiv.org/abs/nlin/0204057. arXiv:nlin/0204057

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.67.015206

[46] [46]

Shapiro.A Fixed-Point Farrago

J. Shapiro.A Fixed-Point Farrago. Springer, 01 2016. ISBN 978-3-319-27976-3. doi:10.1007/978-3-319-27978-7

work page doi:10.1007/978-3-319-27978-7 2016

[47] [47]

J. M. Smith.Evolution and the Theory of Games. Cambridge University Press, Cam- bridge, 1982. ISBN 978-0-521-28884-2. doi:10.1017/CBO9780511806292. URL https://www.cambridge.org/core/books/evolution-and-the-theory-of-games/ A3BDF54AF5C6297E308AB15BBEF45E48. 13

work page doi:10.1017/cbo9780511806292 1982

[48] [48]

Srinivasan, M

S. Srinivasan, M. Lanctot, V . Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments.Advances in neural information processing systems, 31, 2018

work page 2018

[49] [49]

R. S. Sutton and A. G. Barto. Reinforcement learning - an introduction, 2nd edition. 2018. URLhttps://api.semanticscholar.org/CorpusID:277058247

work page 2018

[50] [50]

Tuyls, K

K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model for q-learning in multi- agent systems. InProceedings of the second international joint conference on Autonomous agents and multiagent systems, AAMAS ’03, pages 693–700, New York, NY , USA, July 2003. Association for Computing Machinery. ISBN 978-1-58113-683-8. doi:10.1145/860575.860687. U...

work page doi:10.1145/860575.860687 2003

[51] [51]

Villani et al.Optimal transport: old and new, volume 338

C. Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

work page 2009

[52] [52]

J. Wang, S. Suri, and D. J. Watts. Cooperation and assortativity with dynamic partner updating.Proceedings of the National Academy of Sciences, 109(36):14363–14368, Sept

work page

[53] [53]

URL https://www.pnas.org/doi/10.1073/pnas

doi:10.1073/pnas.1120867109. URL https://www.pnas.org/doi/10.1073/pnas. 1120867109

work page doi:10.1073/pnas.1120867109

[54] [54]

R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning.Machine Learning, 8(3-4):229–256, May 1992. ISSN 0885-6125, 1573-

work page 1992

[55] [55]

URL https://link.springer.com/10.1023/A: 1022672621406

doi:10.1023/A:1022672621406. URL https://link.springer.com/10.1023/A: 1022672621406

work page doi:10.1023/a:1022672621406

[56] [56]

Will systems of llm agents cooperate: An investigation into a social dilemma,

R. Willis, Y . Du, J. Z. Leibo, and M. Luck. Will Systems of LLM Agents Cooperate: An Inves- tigation into a Social Dilemma, Jan. 2025. URLhttps://arxiv.org/abs/2501.16173v1

work page arXiv 2025

[57] [57]

Zhang, S.-J

B.-Y . Zhang, S.-J. Fan, C. Li, X.-D. Zheng, J.-Z. Bao, R. Cressman, and Y . Tao. Opting out against defection leads to stable coexistence with cooperation.Scientific Reports, 6:35902, Oct

work page

[58] [58]

doi:10.1038/srep35902

ISSN 2045-2322. doi:10.1038/srep35902. URL https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC5075917/

work page doi:10.1038/srep35902 2045

[59] [59]

Zheng, C

X.-D. Zheng, C. Li, J.-R. Yu, S.-C. Wang, S.-J. Fan, B.-Y . Zhang, and Y . Tao. A simple rule of direct reciprocity leads to the stable coexistence of cooperation and defection in the Prisoner’s Dilemma game.Journal of Theoretical Biology, 420:12–17, May 2017. ISSN 00225193. doi:10.1016/j.jtbi.2017.02.036. URL https://linkinghub.elsevier.com/ retrieve/pii...

work page doi:10.1016/j.jtbi.2017.02.036 2017

[60] [60]

+c 2x(1−x),Var(r 1|a0 =D) =b 2µ1(1−µ 1) +c 2x(1−x). The covariance terms are Cov(r0, r1|a0 =C) =b 2Cov(Y0, Y1|a0 =C) =b 2(E[Y0Y1|a0 =C]−E[Y 0|a0 =C]E[Y 1|a0 =C]) =b 2(µ3 +µ 2 1 −µ 2µ1 −µ 1(µ2 +µ 1 −µ 2 1)) =b 2(µ3 −2µ 2µ1 +µ 3 1) Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|a0 =D]) =b 2(µ2 1 −µ 2

work page

[61] [61]

= 0 It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[xyρ(y) +ρ(y)(1−xµ 1)]dy =b(µ 1 +x(µ 2 −µ 2 1)) E[r1|a1 =D] =bm 1(x) +c =b(µ 1 +x(µ 2 −µ 2 1)) +c Var(r1|a1 =a) =b 2m1(x)(1−m 1(x)) =b 2(µ1 +x(µ 2 −µ 2 1))(1−(µ 1 +x(µ 2 −µ 2 1))) Collating these terms, we can summarise the elements of th...

work page

[62] [62]

+ (µ1 −µ 2 1) + 2(µ3 −2µ 2µ1 +µ 3 1)] +c 2x(1−x) [b(µ2 + 2µ1 −µ 2

work page

[63] [63]

+c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1))−β] 2 D b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1)) +c−β] 2 18 ROFTThe expected cooperation rate of the opponent in the next step given the current opponent and the focal agent’s action is E[Y1|Y0, a0 =C]...

work page

[64] [64]

= 0 Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|D]) =b 2(µ2 −µ 3 +µ 1µ2 −µ 1(µ1 +µ 2 1 −µ 2)) =b 2(µ2 −µ 2 1 + 2µ1µ2 −µ 3 1 −µ 3) It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[(1−x)(1−y)ρ(y) +ρ(y)(1−(1−x)(1−µ 1))]dy =b(µ 1 −(1−x)(µ 2 −µ 2 1)) E[r1|a1 =D]...

work page

[65] [65]

+c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2(µ1 +µ 2 −2µ 2

work page

[66] [66]

+c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2µ1(1−µ 1) [bµ 1 −β] 2 D b 2µ1(1−µ 1) [bµ 1 +c−β] 2 Always SwitchSince the transition is independent of the action, for both actions a∈ {C, D} the expected cooperation rate of the opponent in the next step given the current opponent and action is E[Y1|Y0, a0 =a] =µ 1. Then computing the conditional mean, E[Y1|a] =E[E[Y ...

work page

[67] [67]

Collating these terms, we can summarise the elements of the second moment in Table 9

= 0 Note that for conditioning at step h= 1 , the same derivation as Always Stay holds. Collating these terms, we can summarise the elements of the second moment in Table 9. Table 9: Second moments (Sh a ) of episodic reward whenH= 2under Always Switch. h aVar(R h|ah =a) (G h a −β) 2 0C2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x)...

work page