pith. sign in

arxiv: 2605.18185 · v1 · pith:NIAGMWQYnew · submitted 2026-05-18 · 💻 cs.MA

The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection

Pith reviewed 2026-05-20 00:18 UTC · model grok-4.3

classification 💻 cs.MA
keywords policy gradientsocial dilemmaspartner selectioncooperation emergencemulti-agent learningopponent distributionWiener processstationary distribution
0
0 comments X

The pith

Partner selection changes opponent distributions to promote cooperation in policy-gradient social dilemmas when population variance is present.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an analytical model for how partner selection influences the learning dynamics of self-interested agents playing social dilemmas. It demonstrates that selection alters the distribution of encountered opponents, which in turn reshapes the reward landscape in a way that favors cooperation according to established rules. The work identifies population variance as a necessary condition for cooperation to arise and uses a two-dimensional Wiener process to incorporate stochastic effects from random partner encounters. Simulations validate that the resulting model matches observed policy-gradient behavior and shows how learning rates influence whether cooperation stabilizes.

Core claim

Partner selection modifies the opponent distribution and thereby the reward landscape faced by policy-gradient learners, which promotes cooperation under simple rules from the literature. Population variance is a necessary condition for cooperation to emerge. A two-dimensional Wiener process captures the stochastic effects of partner selection, yielding a sufficient condition for the population to be cooperation-promoting and proving the existence of a stationary distribution.

What carries the argument

The shift in opponent distribution induced by partner selection, modeled through a two-dimensional Wiener process to represent stochastic encounters.

If this is right

  • Cooperation emerges reliably in populations that maintain variance under partner selection.
  • The stochastic model accurately reproduces the full policy-gradient dynamics observed in simulations.
  • The learning rate controls the speed and stability of the transition to cooperation.
  • A derived sufficient condition identifies which populations will be cooperation-promoting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distribution-shift mechanism might be tested in other multi-agent learning algorithms to check generality beyond policy gradients.
  • Engineering environments with controlled variance could be explored as a design lever for encouraging cooperation in applied settings.
  • The stationary distribution result suggests long-run statistical predictions for agent behavior that could be checked against empirical multi-agent data.

Load-bearing premise

Partner selection effects can be fully captured by shifts in opponent distribution, and a two-dimensional Wiener process adequately models the stochastic encounters so that prior simple rules apply directly.

What would settle it

A controlled simulation in which population variance is set to zero yet cooperation still emerges and persists under partner selection and policy-gradient updates would contradict the necessity claim.

Figures

Figures reproduced from arXiv: 2605.18185 by Benedict Russell, Chin-wing Leung, Paolo Turrini.

Figure 1
Figure 1. Figure 1: Evolution of the strategy distribution where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of the strategy distribution under OFT where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of the strategy distribution where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of the strategy distribution where the population is initialised with [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
read the original abstract

In social dilemmas self-interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent-based simulations. In this paper, we provide an analytical solution to the problem, studying the policy-gradient dynamics in a multi-agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two-dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation-promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy-gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes policy-gradient dynamics in multi-agent social dilemmas with partner selection. It claims that partner selection alters the opponent distribution and reward landscape to promote cooperation under known rules from the literature, with population variance as a necessary condition for cooperation to emerge. The deterministic dynamics are extended via a two-dimensional Wiener process to model stochastic opponent encounters, yielding a sufficient condition for a cooperation-promoting population, a proof of stationary distribution existence, and simulation validation that the stochastic model captures the dynamics and learning-rate effects on cooperation.

Significance. If the derivations and proofs hold, the work would provide a valuable analytical bridge between simulation-based evidence on assortment mechanisms and policy-gradient learning in social dilemmas. It would establish variance as a necessary condition and offer diffusion-based conditions for stationary cooperative outcomes, strengthening theoretical understanding in multi-agent RL.

major comments (2)
  1. [§4] §4 (stochastic extension via 2D Wiener process): The derivation of the sufficient condition for a cooperation-promoting population and the proof of stationary distribution existence both rest on this diffusion approximation for partner selection. The approximation assumes independent stochastic encounters that may fail to preserve discrete matching correlations or finite-population effects inherent in actual partner selection; if so, the necessity of population variance for cooperation does not transfer to the original multi-agent system.
  2. [Abstract and §3–4] The claim that population variance is a necessary condition (stated in the abstract and derived from opponent-distribution shifts): This is load-bearing for the central result, yet its validity depends on the Wiener process accurately reproducing the higher-order statistics of partner selection; without explicit verification against the discrete matching process (e.g., via comparison of moments or simulation of finite-N effects), the necessity result remains conditional on the approximation.
minor comments (2)
  1. [Abstract] The abstract refers to 'simple rules known from the literature' without naming them; these should be explicitly cited in the introduction or model section for clarity.
  2. [§4] Notation for the two-dimensional Wiener process and its drift/diffusion terms could be introduced earlier with a clear link to the deterministic policy-gradient equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope of our analytical results. We address each major comment below, clarifying the separation between our deterministic analysis and the stochastic extension while committing to revisions that strengthen the validation of the approximation.

read point-by-point responses
  1. Referee: [§4] §4 (stochastic extension via 2D Wiener process): The derivation of the sufficient condition for a cooperation-promoting population and the proof of stationary distribution existence both rest on this diffusion approximation for partner selection. The approximation assumes independent stochastic encounters that may fail to preserve discrete matching correlations or finite-population effects inherent in actual partner selection; if so, the necessity of population variance for cooperation does not transfer to the original multi-agent system.

    Authors: We appreciate the referee highlighting the limitations of the diffusion approximation. We clarify that the necessity of population variance for cooperation emergence is derived analytically from the deterministic policy-gradient dynamics under partner selection (Section 3), based on the induced shifts in opponent distribution; this result is established independently of the stochastic model. The two-dimensional Wiener process in Section 4 is introduced afterward specifically to obtain a sufficient condition for cooperation-promoting populations and to prove existence of a stationary distribution. We agree that the approximation idealizes encounters as independent and may not capture all higher-order correlations or finite-population effects present in discrete partner selection. Accordingly, we will revise the manuscript to expand the discussion of the diffusion approximation's assumptions, its relation to the discrete process, and to include additional simulations examining finite-N effects and correlation preservation. revision: partial

  2. Referee: [Abstract and §3–4] The claim that population variance is a necessary condition (stated in the abstract and derived from opponent-distribution shifts): This is load-bearing for the central result, yet its validity depends on the Wiener process accurately reproducing the higher-order statistics of partner selection; without explicit verification against the discrete matching process (e.g., via comparison of moments or simulation of finite-N effects), the necessity result remains conditional on the approximation.

    Authors: We note that the necessity claim is obtained from the deterministic analysis of opponent-distribution shifts due to partner selection (Section 3 and appendix proofs) and does not rely on the Wiener process, which is used only for the subsequent stochastic extension and sufficient-condition derivation. The manuscript already reports simulations showing that the stochastic model captures the overall policy-gradient dynamics and learning-rate effects. To directly respond to the concern about higher-order statistics, we will add explicit moment comparisons between the discrete partner-selection process and the diffusion approximation, together with finite-population simulations, thereby providing the requested verification and removing any conditionality on the approximation for the necessity result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained analytical modeling

full rationale

The paper constructs an explicit stochastic model via a two-dimensional Wiener process to approximate partner selection effects on opponent distributions, then derives a sufficient condition for cooperation promotion and proves existence of a stationary distribution from the resulting Fokker-Planck or Kolmogorov forward equations. These steps are forward derivations from the stated diffusion approximation and the imported simple rules from the literature; they do not reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. The necessity of population variance follows from the variance term in the derived drift or diffusion coefficients rather than being presupposed. No quoted equation equates a claimed prediction directly to an input fit or prior self-result. The analysis therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the model appears to rest on standard policy-gradient assumptions and a Wiener-process approximation whose details are not visible.

pith-pipeline@v0.9.0 · 5699 in / 1186 out tokens · 50030 ms · 2026-05-20T00:18:10.978708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 2 internal anchors

  1. [1]

    Anastassacos, S

    N. Anastassacos, S. Hailes, and M. Musolesi. Partner Selection for the Emergence of Cooperation in Multi-Agent Systems Using Reinforcement Learning, Feb. 2020. URL https://aaai.org/Library/conferences-library.php. Conference Name: Thirty- Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Meeting Name: Thirty-Fourth AAAI Conference on Artificial ...

  2. [2]

    J. Bara, P. Turrini, and G. Andrighetto. Enabling imitation-based cooperation in dy- namic social networks.Autonomous Agents and Multi-Agent Systems, 36(2):34, May 2022. ISSN 1573-7454. doi:10.1007/s10458-022-09562-w. URL https://doi.org/10.1007/ s10458-022-09562-w

  3. [3]

    Bernasconi, F

    M. Bernasconi, F. Cacciamani, S. Fioravanti, N. Gatti, and F. Trovò. The evolutionary dynamics of soft-max policy gradient in multi-agent settings.Theoretical Computer Science, 1027: 115011, Feb. 2025. ISSN 0304-3975. doi:10.1016/j.tcs.2024.115011. URL https://www. sciencedirect.com/science/article/pii/S0304397524006285

  4. [4]

    Billingsley.Convergence of probability measures

    P. Billingsley.Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication

  5. [5]

    Journal of Artificial Intelligence Research53, 659–697 (2015) https://doi.org/10.1613/jair.4818 15

    D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers. Evolutionary Dynamics of Multi-Agent Learning: A Survey.Journal of Artificial Intelligence Research, 53:659–697, Aug. 2015. doi:10.1613/jair.4818

  6. [6]

    Bonnet and F

    B. Bonnet and F. Rossi. The pontryagin maximum principle in the wasserstein space.Calculus of Variations and Partial Differential Equations, 58(1):11, Dec. 2018. ISSN 0944-2669, 1432-0835. doi:10.1007/s00526-018-1447-2. URLhttp://arxiv.org/abs/1711.07667

  7. [7]

    C. Chu, Y . Li, J. Liu, S. Hu, X. Li, and Z. Wang. A Formal Model for Multiagent Q- Learning Dynamics on Regular Graphs. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 194–200, Vienna, Austria, July 2022. Inter- national Joint Conferences on Artificial Intelligence Organization. ISBN 978-1-956792-00-3. d...

  8. [8]

    Fan, C.-w

    X. Fan, C.-w. Leung, and P. Turrini. Co-learning of strategy and structure achieves full cooperation in complex networks with dynamical linking. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 72–80. International Joint Conferences on Artificial Intelligence Organization, Sep. 2025. doi:10.24963/ijcai.2025/9

  9. [9]

    K. Fehl, D. J. van der Post, and D. Semmann. Co-evolution of behaviour and social network structure promotes human cooperation.Ecology Letters, 14(6):546–551, June 2011. ISSN 1461-0248. doi:10.1111/j.1461-0248.2011.01615.x

  10. [10]

    F. Fu, T. Wu, and L. Wang. Partner switching stabilizes cooperation in coevolutionary pris- oner’s dilemma.Physical Review E, 79(3):036101, Mar. 2009. ISSN 1539-3755, 1550-

  11. [11]

    URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101

    doi:10.1103/PhysRevE.79.036101. URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101

  12. [12]

    Fudenberg and C

    D. Fudenberg and C. Harris. Evolutionary dynamics with aggregate shocks.Journal of Economic Theory, 57(2):420–441, Aug. 1992. ISSN 00220531. doi:10.1016/0022-0531(92)90044-I. URL https://linkinghub.elsevier.com/retrieve/pii/002205319290044I

  13. [13]

    Galstyan

    A. Galstyan. Continuous strategy replicator dynamics for multi-agent Q-learning.Autonomous Agents and Multi-Agent Systems, 26(1):37–53, Jan. 2013. ISSN 1573-7454. doi:10.1007/s10458- 011-9181-6. URLhttps://doi.org/10.1007/s10458-011-9181-6

  14. [14]

    S. Hu, C. wing Leung, and H. fung Leung. Modelling the dynamics of multiagent q-learning in repeated symmetric games: a mean field theoretic approach. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:202768371. 11

  15. [15]

    Hu, C.-w

    S. Hu, C.-w. Leung, H.-f. Leung, and H. Soh. The dynamics of q-learning in population games: A physics-inspired continuity equation model.arXiv preprint arXiv:2203.01500, 2022

  16. [16]

    K. Itô. Stochastic integral.Proceedings of the Imperial Academy, 20(8): 519–524, Jan. 1944. ISSN 0369-9846. doi:10.3792/pia/1195572786. URL https://projecteuclid.org/journals/proceedings-of-the-imperial-academy/ volume-20/issue-8/Stochastic-integral/10.3792/pia/1195572786.full

  17. [17]

    L. R. Izquierdo, S. S. Izquierdo, and F. Vega-Redondo. Leave and let leave: A sufficient condition to explain the evolutionary emergence of cooperation.Journal of Economic Dynamics and Control, 46:91–113, Sept. 2014. ISSN 0165-1889. doi:10.1016/j.jedc.2014.06.007. URL https://www.sciencedirect.com/science/article/pii/S0165188914001456

  18. [18]

    S. S. Izquierdo, L. R. Izquierdo, and F. Vega-Redondo. The option to leave: Conditional dissociation in the evolution of cooperation.Journal of Theoretical Biology, 267(1):76–84, Nov

  19. [19]

    doi:10.1016/j.jtbi.2010.07.039

    ISSN 0022-5193. doi:10.1016/j.jtbi.2010.07.039. URL https://www.sciencedirect. com/science/article/pii/S0022519310003966

  20. [20]

    Y . Kifer. Random Perturbations of Dynamical Systems, 1988. URLhttps://link.springer. com/book/10.1007/978-1-4615-8181-9

  21. [21]

    Leung and P

    C.-w. Leung and P. Turrini. Learning partner selection rules that sustain cooperation in social dilemmas with the option of opting out. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024

  22. [22]

    Leung, S

    C.-w. Leung, S. Hu, and H.-f. Leung. Modelling the Dynamics of Multi-Agent Q-learning: The Stochastic Effects of Local Interaction and Incomplete Information. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 384–390, Vienna, Austria, July 2022. International Joint Conferences on Artificial Intelligence Org...

  23. [23]

    Leung, S

    C.-w. Leung, S. Hu, and H.-f. Leung. The stochastic evolutionary dynamics of softmax policy gradient in games. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024

  24. [24]

    Leung, P

    C.-w. Leung, P. Turrini, and A. Nowé. Curiosity-driven partner selection accelerates convention emergence in language games. InProc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), 2025

  25. [25]

    Leung, P

    C.-w. Leung, P. Turrini, F. P. Santos, and M. Musolesi. Learning to cooperate with minimal observability. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29521–29529, 2026

  26. [26]

    Z. Li, Z. Yang, T. Wu, and L. Wang. Aspiration-Based Partner Switching Boosts Co- operation in Social Dilemmas.PLOS ONE, 9(6):e97866, June 2014. ISSN 1932-

  27. [27]

    URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866

    doi:10.1371/journal.pone.0097866. URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866

  28. [28]

    The emergence of rational behavior in the presence of stochastic perturbations

    P. Mertikopoulos and A. L. Moustakas. The emergence of rational behavior in the presence of stochastic perturbations.The Annals of Applied Probability, 20(4), Aug. 2010. ISSN 1050-5164. doi:10.1214/09-AAP651. URL http://arxiv.org/abs/0906.2094. arXiv:0906.2094 [math]

  29. [29]

    J. Nash. Non-Cooperative Games.Annals of Mathematics, 54(2):286–295, 1951. ISSN 0003-486X. doi:10.2307/1969529. URLhttps://www.jstor.org/stable/1969529

  30. [30]

    Nguyen, H

    D. Nguyen, H. Le, K. Do, S. Gupta, S. Venkatesh, and T. Tran. Navigating social dilemmas with llm-based agents via consideration of future consequences. In J. Kwok, editor,Proceed- ings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 223–231. International Joint Conferences on Artificial Intelligence Organiz...

  31. [31]

    M. A. Nowak. Five rules for the evolution of cooperation.Science (New York, N.y.), 314 (5805):1560–1563, Dec. 2006. ISSN 0036-8075. doi:10.1126/science.1133755. URL https: //pmc.ncbi.nlm.nih.gov/articles/PMC3279745/

  32. [32]

    J. M. Pacheco, A. Traulsen, and M. A. Nowak. Active linking in evolutionary games.Journal of Theoretical Biology, 243(3):437–443, Dec. 2006. ISSN 0022-

  33. [33]

    URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736

    doi:10.1016/j.jtbi.2006.06.027. URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736

  34. [34]

    J. M. Pacheco, A. Traulsen, and M. A. Nowak. Coevolution of Strategy and Structure in Complex Networks with Dynamical Linking.Physical Review Letters, 97(25):258103, Dec

  35. [35]

    doi:10.1103/PhysRevLett.97.258103

    ISSN 0031-9007, 1079-7114. doi:10.1103/PhysRevLett.97.258103. URL https: //link.aps.org/doi/10.1103/PhysRevLett.97.258103

  36. [36]

    J. M. Pacheco, A. Traulsen, H. Ohtsuki, and M. A. Nowak. Repeated games and direct reciprocity under active linking.Journal of Theoretical Biology, 250(4):723–731, Feb. 2008. ISSN 00225193. doi:10.1016/j.jtbi.2007.10.040. URL https://linkinghub.elsevier. com/retrieve/pii/S0022519307005450

  37. [37]

    Priklopil, K

    T. Priklopil, K. Chatterjee, and M. Nowak. Optional interactions and suspicious behaviour facilitates trustful cooperation in prisoners dilemma.Journal of Theoretical Biology, 433: 64–72, Nov. 2017. ISSN 0022-5193. doi:10.1016/j.jtbi.2017.08.025. URL https://www. sciencedirect.com/science/article/pii/S0022519317303995

  38. [38]

    D. G. Rand, S. Arbesman, and N. A. Christakis. Dynamic social networks promote cooperation in experiments with humans.Proceedings of the National Academy of Sciences, 108(48): 19193–19198, Nov. 2011. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1108243108. URL https://pnas.org/doi/full/10.1073/pnas.1108243108

  39. [39]

    H. Risken. Fokker-Planck Equation. In H. Risken, editor,The Fokker-Planck Equation: Methods of Solution and Applications, pages 63–95. Springer, Berlin, Heidelberg, 1996. ISBN 978-3-642-61544-3. doi:10.1007/978-3-642-61544-3_4. URL https://doi.org/10.1007/ 978-3-642-61544-3_4

  40. [40]

    Rudin.Principles of Mathematical Analysis

    W. Rudin.Principles of Mathematical Analysis. McGraw-Hill, 3 edition, 1976. ISBN 978- 0070856134

  41. [41]

    Russell, C.-w

    B. Russell, C.-w. Leung, and P. Turrini. Defection at first sight: learning partner selection in optional social dilemmas without prior information. In25th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; ACM Digital library, May 2026. In Press

  42. [42]

    Sabater-Mir and C

    J. Sabater-Mir and C. Sierra. Reputation and social network analysis in multi-agent systems. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, pages 475–482, July 2002. doi:10.1145/544741.544854

  43. [43]

    F. C. Santos, J. M. Pacheco, and T. Lenaerts. Cooperation Prevails When Individuals Adjust Their Social Ties.PLoS Computational Biology, 2(10):e140, Oct. 2006. ISSN 1553-7358. doi:10.1371/journal.pcbi.0020140. URL https://dx.plos.org/10.1371/journal.pcbi. 0020140

  44. [44]

    Sato and J

    Y . Sato and J. P. Crutchfield. Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems.Physical Review E, 67(1):015206, Jan. 2003. ISSN 1063-651X, 1095-

  45. [45]

    Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems

    doi:10.1103/PhysRevE.67.015206. URL http://arxiv.org/abs/nlin/0204057. arXiv:nlin/0204057

  46. [46]

    Shapiro.A Fixed-Point Farrago

    J. Shapiro.A Fixed-Point Farrago. Springer, 01 2016. ISBN 978-3-319-27976-3. doi:10.1007/978-3-319-27978-7

  47. [47]

    J. M. Smith.Evolution and the Theory of Games. Cambridge University Press, Cam- bridge, 1982. ISBN 978-0-521-28884-2. doi:10.1017/CBO9780511806292. URL https://www.cambridge.org/core/books/evolution-and-the-theory-of-games/ A3BDF54AF5C6297E308AB15BBEF45E48. 13

  48. [48]

    Srinivasan, M

    S. Srinivasan, M. Lanctot, V . Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments.Advances in neural information processing systems, 31, 2018

  49. [49]

    R. S. Sutton and A. G. Barto. Reinforcement learning - an introduction, 2nd edition. 2018. URLhttps://api.semanticscholar.org/CorpusID:277058247

  50. [50]

    Tuyls, K

    K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model for q-learning in multi- agent systems. InProceedings of the second international joint conference on Autonomous agents and multiagent systems, AAMAS ’03, pages 693–700, New York, NY , USA, July 2003. Association for Computing Machinery. ISBN 978-1-58113-683-8. doi:10.1145/860575.860687. U...

  51. [51]

    Villani et al.Optimal transport: old and new, volume 338

    C. Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

  52. [52]

    J. Wang, S. Suri, and D. J. Watts. Cooperation and assortativity with dynamic partner updating.Proceedings of the National Academy of Sciences, 109(36):14363–14368, Sept

  53. [53]

    URL https://www.pnas.org/doi/10.1073/pnas

    doi:10.1073/pnas.1120867109. URL https://www.pnas.org/doi/10.1073/pnas. 1120867109

  54. [54]

    R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning.Machine Learning, 8(3-4):229–256, May 1992. ISSN 0885-6125, 1573-

  55. [55]

    URL https://link.springer.com/10.1023/A: 1022672621406

    doi:10.1023/A:1022672621406. URL https://link.springer.com/10.1023/A: 1022672621406

  56. [56]

    Will systems of llm agents cooperate: An investigation into a social dilemma,

    R. Willis, Y . Du, J. Z. Leibo, and M. Luck. Will Systems of LLM Agents Cooperate: An Inves- tigation into a Social Dilemma, Jan. 2025. URLhttps://arxiv.org/abs/2501.16173v1

  57. [57]

    Zhang, S.-J

    B.-Y . Zhang, S.-J. Fan, C. Li, X.-D. Zheng, J.-Z. Bao, R. Cressman, and Y . Tao. Opting out against defection leads to stable coexistence with cooperation.Scientific Reports, 6:35902, Oct

  58. [58]

    doi:10.1038/srep35902

    ISSN 2045-2322. doi:10.1038/srep35902. URL https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC5075917/

  59. [59]

    Zheng, C

    X.-D. Zheng, C. Li, J.-R. Yu, S.-C. Wang, S.-J. Fan, B.-Y . Zhang, and Y . Tao. A simple rule of direct reciprocity leads to the stable coexistence of cooperation and defection in the Prisoner’s Dilemma game.Journal of Theoretical Biology, 420:12–17, May 2017. ISSN 00225193. doi:10.1016/j.jtbi.2017.02.036. URL https://linkinghub.elsevier.com/ retrieve/pii...

  60. [60]

    +c 2x(1−x),Var(r 1|a0 =D) =b 2µ1(1−µ 1) +c 2x(1−x). The covariance terms are Cov(r0, r1|a0 =C) =b 2Cov(Y0, Y1|a0 =C) =b 2(E[Y0Y1|a0 =C]−E[Y 0|a0 =C]E[Y 1|a0 =C]) =b 2(µ3 +µ 2 1 −µ 2µ1 −µ 1(µ2 +µ 1 −µ 2 1)) =b 2(µ3 −2µ 2µ1 +µ 3 1) Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|a0 =D]) =b 2(µ2 1 −µ 2

  61. [61]

    = 0 It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[xyρ(y) +ρ(y)(1−xµ 1)]dy =b(µ 1 +x(µ 2 −µ 2 1)) E[r1|a1 =D] =bm 1(x) +c =b(µ 1 +x(µ 2 −µ 2 1)) +c Var(r1|a1 =a) =b 2m1(x)(1−m 1(x)) =b 2(µ1 +x(µ 2 −µ 2 1))(1−(µ 1 +x(µ 2 −µ 2 1))) Collating these terms, we can summarise the elements of th...

  62. [62]

    + (µ1 −µ 2 1) + 2(µ3 −2µ 2µ1 +µ 3 1)] +c 2x(1−x) [b(µ2 + 2µ1 −µ 2

  63. [63]

    +c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1))−β] 2 D b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1)) +c−β] 2 18 ROFTThe expected cooperation rate of the opponent in the next step given the current opponent and the focal agent’s action is E[Y1|Y0, a0 =C]...

  64. [64]

    = 0 Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|D]) =b 2(µ2 −µ 3 +µ 1µ2 −µ 1(µ1 +µ 2 1 −µ 2)) =b 2(µ2 −µ 2 1 + 2µ1µ2 −µ 3 1 −µ 3) It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[(1−x)(1−y)ρ(y) +ρ(y)(1−(1−x)(1−µ 1))]dy =b(µ 1 −(1−x)(µ 2 −µ 2 1)) E[r1|a1 =D]...

  65. [65]

    +c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2(µ1 +µ 2 −2µ 2

  66. [66]

    +c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2µ1(1−µ 1) [bµ 1 −β] 2 D b 2µ1(1−µ 1) [bµ 1 +c−β] 2 Always SwitchSince the transition is independent of the action, for both actions a∈ {C, D} the expected cooperation rate of the opponent in the next step given the current opponent and action is E[Y1|Y0, a0 =a] =µ 1. Then computing the conditional mean, E[Y1|a] =E[E[Y ...

  67. [67]

    Collating these terms, we can summarise the elements of the second moment in Table 9

    = 0 Note that for conditioning at step h= 1 , the same derivation as Always Stay holds. Collating these terms, we can summarise the elements of the second moment in Table 9. Table 9: Second moments (Sh a ) of episodic reward whenH= 2under Always Switch. h aVar(R h|ah =a) (G h a −β) 2 0C2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x)...