The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection
Pith reviewed 2026-05-20 00:18 UTC · model grok-4.3
The pith
Partner selection changes opponent distributions to promote cooperation in policy-gradient social dilemmas when population variance is present.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Partner selection modifies the opponent distribution and thereby the reward landscape faced by policy-gradient learners, which promotes cooperation under simple rules from the literature. Population variance is a necessary condition for cooperation to emerge. A two-dimensional Wiener process captures the stochastic effects of partner selection, yielding a sufficient condition for the population to be cooperation-promoting and proving the existence of a stationary distribution.
What carries the argument
The shift in opponent distribution induced by partner selection, modeled through a two-dimensional Wiener process to represent stochastic encounters.
If this is right
- Cooperation emerges reliably in populations that maintain variance under partner selection.
- The stochastic model accurately reproduces the full policy-gradient dynamics observed in simulations.
- The learning rate controls the speed and stability of the transition to cooperation.
- A derived sufficient condition identifies which populations will be cooperation-promoting.
Where Pith is reading between the lines
- The same distribution-shift mechanism might be tested in other multi-agent learning algorithms to check generality beyond policy gradients.
- Engineering environments with controlled variance could be explored as a design lever for encouraging cooperation in applied settings.
- The stationary distribution result suggests long-run statistical predictions for agent behavior that could be checked against empirical multi-agent data.
Load-bearing premise
Partner selection effects can be fully captured by shifts in opponent distribution, and a two-dimensional Wiener process adequately models the stochastic encounters so that prior simple rules apply directly.
What would settle it
A controlled simulation in which population variance is set to zero yet cooperation still emerges and persists under partner selection and policy-gradient updates would contradict the necessity claim.
Figures
read the original abstract
In social dilemmas self-interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent-based simulations. In this paper, we provide an analytical solution to the problem, studying the policy-gradient dynamics in a multi-agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two-dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation-promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy-gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes policy-gradient dynamics in multi-agent social dilemmas with partner selection. It claims that partner selection alters the opponent distribution and reward landscape to promote cooperation under known rules from the literature, with population variance as a necessary condition for cooperation to emerge. The deterministic dynamics are extended via a two-dimensional Wiener process to model stochastic opponent encounters, yielding a sufficient condition for a cooperation-promoting population, a proof of stationary distribution existence, and simulation validation that the stochastic model captures the dynamics and learning-rate effects on cooperation.
Significance. If the derivations and proofs hold, the work would provide a valuable analytical bridge between simulation-based evidence on assortment mechanisms and policy-gradient learning in social dilemmas. It would establish variance as a necessary condition and offer diffusion-based conditions for stationary cooperative outcomes, strengthening theoretical understanding in multi-agent RL.
major comments (2)
- [§4] §4 (stochastic extension via 2D Wiener process): The derivation of the sufficient condition for a cooperation-promoting population and the proof of stationary distribution existence both rest on this diffusion approximation for partner selection. The approximation assumes independent stochastic encounters that may fail to preserve discrete matching correlations or finite-population effects inherent in actual partner selection; if so, the necessity of population variance for cooperation does not transfer to the original multi-agent system.
- [Abstract and §3–4] The claim that population variance is a necessary condition (stated in the abstract and derived from opponent-distribution shifts): This is load-bearing for the central result, yet its validity depends on the Wiener process accurately reproducing the higher-order statistics of partner selection; without explicit verification against the discrete matching process (e.g., via comparison of moments or simulation of finite-N effects), the necessity result remains conditional on the approximation.
minor comments (2)
- [Abstract] The abstract refers to 'simple rules known from the literature' without naming them; these should be explicitly cited in the introduction or model section for clarity.
- [§4] Notation for the two-dimensional Wiener process and its drift/diffusion terms could be introduced earlier with a clear link to the deterministic policy-gradient equations.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which help clarify the scope of our analytical results. We address each major comment below, clarifying the separation between our deterministic analysis and the stochastic extension while committing to revisions that strengthen the validation of the approximation.
read point-by-point responses
-
Referee: [§4] §4 (stochastic extension via 2D Wiener process): The derivation of the sufficient condition for a cooperation-promoting population and the proof of stationary distribution existence both rest on this diffusion approximation for partner selection. The approximation assumes independent stochastic encounters that may fail to preserve discrete matching correlations or finite-population effects inherent in actual partner selection; if so, the necessity of population variance for cooperation does not transfer to the original multi-agent system.
Authors: We appreciate the referee highlighting the limitations of the diffusion approximation. We clarify that the necessity of population variance for cooperation emergence is derived analytically from the deterministic policy-gradient dynamics under partner selection (Section 3), based on the induced shifts in opponent distribution; this result is established independently of the stochastic model. The two-dimensional Wiener process in Section 4 is introduced afterward specifically to obtain a sufficient condition for cooperation-promoting populations and to prove existence of a stationary distribution. We agree that the approximation idealizes encounters as independent and may not capture all higher-order correlations or finite-population effects present in discrete partner selection. Accordingly, we will revise the manuscript to expand the discussion of the diffusion approximation's assumptions, its relation to the discrete process, and to include additional simulations examining finite-N effects and correlation preservation. revision: partial
-
Referee: [Abstract and §3–4] The claim that population variance is a necessary condition (stated in the abstract and derived from opponent-distribution shifts): This is load-bearing for the central result, yet its validity depends on the Wiener process accurately reproducing the higher-order statistics of partner selection; without explicit verification against the discrete matching process (e.g., via comparison of moments or simulation of finite-N effects), the necessity result remains conditional on the approximation.
Authors: We note that the necessity claim is obtained from the deterministic analysis of opponent-distribution shifts due to partner selection (Section 3 and appendix proofs) and does not rely on the Wiener process, which is used only for the subsequent stochastic extension and sufficient-condition derivation. The manuscript already reports simulations showing that the stochastic model captures the overall policy-gradient dynamics and learning-rate effects. To directly respond to the concern about higher-order statistics, we will add explicit moment comparisons between the discrete partner-selection process and the diffusion approximation, together with finite-population simulations, thereby providing the requested verification and removing any conditionality on the approximation for the necessity result. revision: yes
Circularity Check
No significant circularity; derivation is self-contained analytical modeling
full rationale
The paper constructs an explicit stochastic model via a two-dimensional Wiener process to approximate partner selection effects on opponent distributions, then derives a sufficient condition for cooperation promotion and proves existence of a stationary distribution from the resulting Fokker-Planck or Kolmogorov forward equations. These steps are forward derivations from the stated diffusion approximation and the imported simple rules from the literature; they do not reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. The necessity of population variance follows from the variance term in the derived drift or diffusion coefficients rather than being presupposed. No quoted equation equates a claimed prediction directly to an input fit or prior self-result. The analysis therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
population variance is a necessary condition for cooperation to emerge
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
N. Anastassacos, S. Hailes, and M. Musolesi. Partner Selection for the Emergence of Cooperation in Multi-Agent Systems Using Reinforcement Learning, Feb. 2020. URL https://aaai.org/Library/conferences-library.php. Conference Name: Thirty- Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Meeting Name: Thirty-Fourth AAAI Conference on Artificial ...
work page 2020
-
[2]
J. Bara, P. Turrini, and G. Andrighetto. Enabling imitation-based cooperation in dy- namic social networks.Autonomous Agents and Multi-Agent Systems, 36(2):34, May 2022. ISSN 1573-7454. doi:10.1007/s10458-022-09562-w. URL https://doi.org/10.1007/ s10458-022-09562-w
-
[3]
M. Bernasconi, F. Cacciamani, S. Fioravanti, N. Gatti, and F. Trovò. The evolutionary dynamics of soft-max policy gradient in multi-agent settings.Theoretical Computer Science, 1027: 115011, Feb. 2025. ISSN 0304-3975. doi:10.1016/j.tcs.2024.115011. URL https://www. sciencedirect.com/science/article/pii/S0304397524006285
-
[4]
Billingsley.Convergence of probability measures
P. Billingsley.Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication
work page 1999
-
[5]
Journal of Artificial Intelligence Research53, 659–697 (2015) https://doi.org/10.1613/jair.4818 15
D. Bloembergen, K. Tuyls, D. Hennes, and M. Kaisers. Evolutionary Dynamics of Multi-Agent Learning: A Survey.Journal of Artificial Intelligence Research, 53:659–697, Aug. 2015. doi:10.1613/jair.4818
-
[6]
B. Bonnet and F. Rossi. The pontryagin maximum principle in the wasserstein space.Calculus of Variations and Partial Differential Equations, 58(1):11, Dec. 2018. ISSN 0944-2669, 1432-0835. doi:10.1007/s00526-018-1447-2. URLhttp://arxiv.org/abs/1711.07667
-
[7]
C. Chu, Y . Li, J. Liu, S. Hu, X. Li, and Z. Wang. A Formal Model for Multiagent Q- Learning Dynamics on Regular Graphs. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 194–200, Vienna, Austria, July 2022. Inter- national Joint Conferences on Artificial Intelligence Organization. ISBN 978-1-956792-00-3. d...
-
[8]
X. Fan, C.-w. Leung, and P. Turrini. Co-learning of strategy and structure achieves full cooperation in complex networks with dynamical linking. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 72–80. International Joint Conferences on Artificial Intelligence Organization, Sep. 2025. doi:10.24963/ijcai.2025/9
-
[9]
K. Fehl, D. J. van der Post, and D. Semmann. Co-evolution of behaviour and social network structure promotes human cooperation.Ecology Letters, 14(6):546–551, June 2011. ISSN 1461-0248. doi:10.1111/j.1461-0248.2011.01615.x
-
[10]
F. Fu, T. Wu, and L. Wang. Partner switching stabilizes cooperation in coevolutionary pris- oner’s dilemma.Physical Review E, 79(3):036101, Mar. 2009. ISSN 1539-3755, 1550-
work page 2009
-
[11]
URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101
doi:10.1103/PhysRevE.79.036101. URL https://link.aps.org/doi/10.1103/ PhysRevE.79.036101
-
[12]
D. Fudenberg and C. Harris. Evolutionary dynamics with aggregate shocks.Journal of Economic Theory, 57(2):420–441, Aug. 1992. ISSN 00220531. doi:10.1016/0022-0531(92)90044-I. URL https://linkinghub.elsevier.com/retrieve/pii/002205319290044I
-
[13]
A. Galstyan. Continuous strategy replicator dynamics for multi-agent Q-learning.Autonomous Agents and Multi-Agent Systems, 26(1):37–53, Jan. 2013. ISSN 1573-7454. doi:10.1007/s10458- 011-9181-6. URLhttps://doi.org/10.1007/s10458-011-9181-6
-
[14]
S. Hu, C. wing Leung, and H. fung Leung. Modelling the dynamics of multiagent q-learning in repeated symmetric games: a mean field theoretic approach. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:202768371. 11
work page 2019
- [15]
-
[16]
K. Itô. Stochastic integral.Proceedings of the Imperial Academy, 20(8): 519–524, Jan. 1944. ISSN 0369-9846. doi:10.3792/pia/1195572786. URL https://projecteuclid.org/journals/proceedings-of-the-imperial-academy/ volume-20/issue-8/Stochastic-integral/10.3792/pia/1195572786.full
-
[17]
L. R. Izquierdo, S. S. Izquierdo, and F. Vega-Redondo. Leave and let leave: A sufficient condition to explain the evolutionary emergence of cooperation.Journal of Economic Dynamics and Control, 46:91–113, Sept. 2014. ISSN 0165-1889. doi:10.1016/j.jedc.2014.06.007. URL https://www.sciencedirect.com/science/article/pii/S0165188914001456
-
[18]
S. S. Izquierdo, L. R. Izquierdo, and F. Vega-Redondo. The option to leave: Conditional dissociation in the evolution of cooperation.Journal of Theoretical Biology, 267(1):76–84, Nov
-
[19]
doi:10.1016/j.jtbi.2010.07.039
ISSN 0022-5193. doi:10.1016/j.jtbi.2010.07.039. URL https://www.sciencedirect. com/science/article/pii/S0022519310003966
-
[20]
Y . Kifer. Random Perturbations of Dynamical Systems, 1988. URLhttps://link.springer. com/book/10.1007/978-1-4615-8181-9
-
[21]
C.-w. Leung and P. Turrini. Learning partner selection rules that sustain cooperation in social dilemmas with the option of opting out. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), 2024
work page 2024
-
[22]
C.-w. Leung, S. Hu, and H.-f. Leung. Modelling the Dynamics of Multi-Agent Q-learning: The Stochastic Effects of Local Interaction and Incomplete Information. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 384–390, Vienna, Austria, July 2022. International Joint Conferences on Artificial Intelligence Org...
- [23]
- [24]
- [25]
-
[26]
Z. Li, Z. Yang, T. Wu, and L. Wang. Aspiration-Based Partner Switching Boosts Co- operation in Social Dilemmas.PLOS ONE, 9(6):e97866, June 2014. ISSN 1932-
work page 2014
-
[27]
URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866
doi:10.1371/journal.pone.0097866. URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0097866
-
[28]
The emergence of rational behavior in the presence of stochastic perturbations
P. Mertikopoulos and A. L. Moustakas. The emergence of rational behavior in the presence of stochastic perturbations.The Annals of Applied Probability, 20(4), Aug. 2010. ISSN 1050-5164. doi:10.1214/09-AAP651. URL http://arxiv.org/abs/0906.2094. arXiv:0906.2094 [math]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1214/09-aap651 2010
-
[29]
J. Nash. Non-Cooperative Games.Annals of Mathematics, 54(2):286–295, 1951. ISSN 0003-486X. doi:10.2307/1969529. URLhttps://www.jstor.org/stable/1969529
-
[30]
D. Nguyen, H. Le, K. Do, S. Gupta, S. Venkatesh, and T. Tran. Navigating social dilemmas with llm-based agents via consideration of future consequences. In J. Kwok, editor,Proceed- ings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 223–231. International Joint Conferences on Artificial Intelligence Organiz...
-
[31]
M. A. Nowak. Five rules for the evolution of cooperation.Science (New York, N.y.), 314 (5805):1560–1563, Dec. 2006. ISSN 0036-8075. doi:10.1126/science.1133755. URL https: //pmc.ncbi.nlm.nih.gov/articles/PMC3279745/
-
[32]
J. M. Pacheco, A. Traulsen, and M. A. Nowak. Active linking in evolutionary games.Journal of Theoretical Biology, 243(3):437–443, Dec. 2006. ISSN 0022-
work page 2006
-
[33]
URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736
doi:10.1016/j.jtbi.2006.06.027. URL https://www.sciencedirect.com/science/ article/pii/S0022519306002736
-
[34]
J. M. Pacheco, A. Traulsen, and M. A. Nowak. Coevolution of Strategy and Structure in Complex Networks with Dynamical Linking.Physical Review Letters, 97(25):258103, Dec
-
[35]
doi:10.1103/PhysRevLett.97.258103
ISSN 0031-9007, 1079-7114. doi:10.1103/PhysRevLett.97.258103. URL https: //link.aps.org/doi/10.1103/PhysRevLett.97.258103
-
[36]
J. M. Pacheco, A. Traulsen, H. Ohtsuki, and M. A. Nowak. Repeated games and direct reciprocity under active linking.Journal of Theoretical Biology, 250(4):723–731, Feb. 2008. ISSN 00225193. doi:10.1016/j.jtbi.2007.10.040. URL https://linkinghub.elsevier. com/retrieve/pii/S0022519307005450
-
[37]
T. Priklopil, K. Chatterjee, and M. Nowak. Optional interactions and suspicious behaviour facilitates trustful cooperation in prisoners dilemma.Journal of Theoretical Biology, 433: 64–72, Nov. 2017. ISSN 0022-5193. doi:10.1016/j.jtbi.2017.08.025. URL https://www. sciencedirect.com/science/article/pii/S0022519317303995
-
[38]
D. G. Rand, S. Arbesman, and N. A. Christakis. Dynamic social networks promote cooperation in experiments with humans.Proceedings of the National Academy of Sciences, 108(48): 19193–19198, Nov. 2011. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1108243108. URL https://pnas.org/doi/full/10.1073/pnas.1108243108
-
[39]
H. Risken. Fokker-Planck Equation. In H. Risken, editor,The Fokker-Planck Equation: Methods of Solution and Applications, pages 63–95. Springer, Berlin, Heidelberg, 1996. ISBN 978-3-642-61544-3. doi:10.1007/978-3-642-61544-3_4. URL https://doi.org/10.1007/ 978-3-642-61544-3_4
-
[40]
Rudin.Principles of Mathematical Analysis
W. Rudin.Principles of Mathematical Analysis. McGraw-Hill, 3 edition, 1976. ISBN 978- 0070856134
work page 1976
-
[41]
B. Russell, C.-w. Leung, and P. Turrini. Defection at first sight: learning partner selection in optional social dilemmas without prior information. In25th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS; ACM Digital library, May 2026. In Press
work page 2026
-
[42]
J. Sabater-Mir and C. Sierra. Reputation and social network analysis in multi-agent systems. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, pages 475–482, July 2002. doi:10.1145/544741.544854
-
[43]
F. C. Santos, J. M. Pacheco, and T. Lenaerts. Cooperation Prevails When Individuals Adjust Their Social Ties.PLoS Computational Biology, 2(10):e140, Oct. 2006. ISSN 1553-7358. doi:10.1371/journal.pcbi.0020140. URL https://dx.plos.org/10.1371/journal.pcbi. 0020140
-
[44]
Y . Sato and J. P. Crutchfield. Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems.Physical Review E, 67(1):015206, Jan. 2003. ISSN 1063-651X, 1095-
work page 2003
-
[45]
Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems
doi:10.1103/PhysRevE.67.015206. URL http://arxiv.org/abs/nlin/0204057. arXiv:nlin/0204057
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1103/physreve.67.015206
-
[46]
J. Shapiro.A Fixed-Point Farrago. Springer, 01 2016. ISBN 978-3-319-27976-3. doi:10.1007/978-3-319-27978-7
-
[47]
J. M. Smith.Evolution and the Theory of Games. Cambridge University Press, Cam- bridge, 1982. ISBN 978-0-521-28884-2. doi:10.1017/CBO9780511806292. URL https://www.cambridge.org/core/books/evolution-and-the-theory-of-games/ A3BDF54AF5C6297E308AB15BBEF45E48. 13
-
[48]
S. Srinivasan, M. Lanctot, V . Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments.Advances in neural information processing systems, 31, 2018
work page 2018
-
[49]
R. S. Sutton and A. G. Barto. Reinforcement learning - an introduction, 2nd edition. 2018. URLhttps://api.semanticscholar.org/CorpusID:277058247
work page 2018
-
[50]
K. Tuyls, K. Verbeeck, and T. Lenaerts. A selection-mutation model for q-learning in multi- agent systems. InProceedings of the second international joint conference on Autonomous agents and multiagent systems, AAMAS ’03, pages 693–700, New York, NY , USA, July 2003. Association for Computing Machinery. ISBN 978-1-58113-683-8. doi:10.1145/860575.860687. U...
-
[51]
Villani et al.Optimal transport: old and new, volume 338
C. Villani et al.Optimal transport: old and new, volume 338. Springer, 2009
work page 2009
-
[52]
J. Wang, S. Suri, and D. J. Watts. Cooperation and assortativity with dynamic partner updating.Proceedings of the National Academy of Sciences, 109(36):14363–14368, Sept
-
[53]
URL https://www.pnas.org/doi/10.1073/pnas
doi:10.1073/pnas.1120867109. URL https://www.pnas.org/doi/10.1073/pnas. 1120867109
-
[54]
R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Rein- forcement Learning.Machine Learning, 8(3-4):229–256, May 1992. ISSN 0885-6125, 1573-
work page 1992
-
[55]
URL https://link.springer.com/10.1023/A: 1022672621406
doi:10.1023/A:1022672621406. URL https://link.springer.com/10.1023/A: 1022672621406
-
[56]
Will systems of llm agents cooperate: An investigation into a social dilemma,
R. Willis, Y . Du, J. Z. Leibo, and M. Luck. Will Systems of LLM Agents Cooperate: An Inves- tigation into a Social Dilemma, Jan. 2025. URLhttps://arxiv.org/abs/2501.16173v1
-
[57]
B.-Y . Zhang, S.-J. Fan, C. Li, X.-D. Zheng, J.-Z. Bao, R. Cressman, and Y . Tao. Opting out against defection leads to stable coexistence with cooperation.Scientific Reports, 6:35902, Oct
-
[58]
ISSN 2045-2322. doi:10.1038/srep35902. URL https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC5075917/
-
[59]
X.-D. Zheng, C. Li, J.-R. Yu, S.-C. Wang, S.-J. Fan, B.-Y . Zhang, and Y . Tao. A simple rule of direct reciprocity leads to the stable coexistence of cooperation and defection in the Prisoner’s Dilemma game.Journal of Theoretical Biology, 420:12–17, May 2017. ISSN 00225193. doi:10.1016/j.jtbi.2017.02.036. URL https://linkinghub.elsevier.com/ retrieve/pii...
-
[60]
+c 2x(1−x),Var(r 1|a0 =D) =b 2µ1(1−µ 1) +c 2x(1−x). The covariance terms are Cov(r0, r1|a0 =C) =b 2Cov(Y0, Y1|a0 =C) =b 2(E[Y0Y1|a0 =C]−E[Y 0|a0 =C]E[Y 1|a0 =C]) =b 2(µ3 +µ 2 1 −µ 2µ1 −µ 1(µ2 +µ 1 −µ 2 1)) =b 2(µ3 −2µ 2µ1 +µ 3 1) Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|a0 =D]) =b 2(µ2 1 −µ 2
-
[61]
= 0 It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[xyρ(y) +ρ(y)(1−xµ 1)]dy =b(µ 1 +x(µ 2 −µ 2 1)) E[r1|a1 =D] =bm 1(x) +c =b(µ 1 +x(µ 2 −µ 2 1)) +c Var(r1|a1 =a) =b 2m1(x)(1−m 1(x)) =b 2(µ1 +x(µ 2 −µ 2 1))(1−(µ 1 +x(µ 2 −µ 2 1))) Collating these terms, we can summarise the elements of th...
-
[62]
+ (µ1 −µ 2 1) + 2(µ3 −2µ 2µ1 +µ 3 1)] +c 2x(1−x) [b(µ2 + 2µ1 −µ 2
-
[63]
+c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1))−β] 2 D b 2(µ1 +x(µ 2 −µ 2 1))(1−µ 1 −x(µ 2 −µ 2 1)) [b(µ 1 +x(µ 2 −µ 2 1)) +c−β] 2 18 ROFTThe expected cooperation rate of the opponent in the next step given the current opponent and the focal agent’s action is E[Y1|Y0, a0 =C]...
-
[64]
= 0 Cov(r0, r1|a0 =D) =b 2Cov(Y0, Y1|a0 =D) =b 2(E[Y0Y1|a0 =D]−E[Y 0|a0 =D]E[Y 1|D]) =b 2(µ2 −µ 3 +µ 1µ2 −µ 1(µ1 +µ 2 1 −µ 2)) =b 2(µ2 −µ 2 1 + 2µ1µ2 −µ 3 1 −µ 3) It remains to condition on the second step,h= 1. Explicitly, the terms are given by E[r1|a1 =C] =bm 1(x) =b Z 1 0 y[(1−x)(1−y)ρ(y) +ρ(y)(1−(1−x)(1−µ 1))]dy =b(µ 1 −(1−x)(µ 2 −µ 2 1)) E[r1|a1 =D]...
-
[65]
+c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2(µ1 +µ 2 −2µ 2
-
[66]
+c 2x(1−x) [2bµ 1 +c(2−x)−β] 2 1C b 2µ1(1−µ 1) [bµ 1 −β] 2 D b 2µ1(1−µ 1) [bµ 1 +c−β] 2 Always SwitchSince the transition is independent of the action, for both actions a∈ {C, D} the expected cooperation rate of the opponent in the next step given the current opponent and action is E[Y1|Y0, a0 =a] =µ 1. Then computing the conditional mean, E[Y1|a] =E[E[Y ...
-
[67]
Collating these terms, we can summarise the elements of the second moment in Table 9
= 0 Note that for conditioning at step h= 1 , the same derivation as Always Stay holds. Collating these terms, we can summarise the elements of the second moment in Table 9. Table 9: Second moments (Sh a ) of episodic reward whenH= 2under Always Switch. h aVar(R h|ah =a) (G h a −β) 2 0C2b 2µ1(1−µ 1) +c 2x(1−x) [2bµ 1 +c(1−x)−β] 2 D2b 2µ1(1−µ 1) +c 2x(1−x)...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.