Reinforcement learning with reputation-based adaptive exploration promotes the evolution of cooperation

An Li; Chaoqian Wang; Hongwei Zheng; Longzhao Liu; Shaoting Tang; Wenqiang Zhu; Xin Wang; Yishen Jiang

arxiv: 2604.08103 · v1 · submitted 2026-04-09 · ⚛️ physics.comp-ph

Reinforcement learning with reputation-based adaptive exploration promotes the evolution of cooperation

An Li , Wenqiang Zhu , Chaoqian Wang , Longzhao Liu , Hongwei Zheng , Yishen Jiang , Xin Wang , Shaoting Tang This is my paper

Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3

classification ⚛️ physics.comp-ph

keywords reinforcement learningQ-learningreputationcooperationevolutionary gamesadaptive explorationmulti-agent systems

0 comments

The pith

Coupling exploration to local reputation in Q-learning promotes cooperation in evolutionary games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a Q-learning model in which agents adapt their exploration rates according to differences in local reputation and apply asymmetric reputation updates that depend on current status. Simulations show that the reputation-coupled exploration and the asymmetric updates each increase cooperation when introduced alone, and that combining them produces a stronger effect. The joint mechanism works by lowering exploration for high-reputation agents while raising it for low-reputation agents, and by magnifying the reputation payoff from cooperation at low status while magnifying the penalty from defection at high status. Readers would care because the model illustrates how social evaluation can steer individual learning toward collective cooperation without requiring fixed rules or external enforcement.

Core claim

Each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making high reputation agents explore less and low reputation agents explore more, while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status.

What carries the argument

Q-learning with exploration rates tied to local reputation differences together with asymmetric state-dependent reputation updates.

If this is right

Cooperation levels rise further when the two mechanisms operate together than when either is used alone.
High-reputation agents become more likely to exploit known cooperative strategies while low-reputation agents continue to sample alternatives.
Reputation payoffs become larger for cooperation when an agent has low status and larger for defection when an agent has high status.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coupling could be tested in settings where reputation is observed with noise or delay.
The approach suggests a route for designing multi-agent systems in which status signals naturally reduce wasteful exploration once good strategies are found.
Real-world reputation platforms might be examined to see whether they produce analogous exploration patterns among users.

Load-bearing premise

Agents can accurately perceive and respond to differences in local reputation when they adjust their exploration rates.

What would settle it

Simulations that keep exploration fixed and use symmetric reputation updates would show no comparable rise in cooperation levels.

Figures

Figures reproduced from arXiv: 2604.08103 by An Li, Chaoqian Wang, Hongwei Zheng, Longzhao Liu, Shaoting Tang, Wenqiang Zhu, Xin Wang, Yishen Jiang.

**Figure 1.** Figure 1: FIG. 1. Adaptive exploration and asymmetric reputation updating independently and directionally reshape the evolution [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: FIG. 2. Synergistic effect between adaptive exploration and asymmetric reputation. (a) Heat map of the fraction of cooperation [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Reputation concern governs the cooperation regime. (a) Bar chart of the fraction of cooperation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Spatiotemporal evolution of strategy and reputation for different reputation concern. Snapshots of strategy (top row in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Impact of baseline exploration rate. We show the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Multi-agent reinforcement learning serves as an effective tool for studying strategy adaptation in evolutionary games. Although prior work has integrated Q-learning with reputation mechanisms to promote cooperation, most existing algorithms adopt fixed exploration rates and overlook the influence of social context on exploratory behavior. In practice, individuals may adjust their willingness to explore based on their reputation and perceived social standing. To address this, we propose a Q-learning model that couples exploration rates with local reputation differences and incorporates asymmetric, state-dependent reputation updates. Our results show that each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making ``high reputation--low exploration, low reputation--high exploration'', while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status. This study thus offers insights into how social evaluation can shape learning behavior in complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds reputation-dependent adaptive exploration and asymmetric updates to Q-learning in evolutionary games, with simulations showing a reinforcing boost to cooperation, but that synergy looks tied to the specific asymmetry chosen.

read the letter

The main takeaway is that this work couples exploration rates to local reputation differences in a Q-learning model and layers on asymmetric, status-dependent reputation updates. Their simulations indicate each piece raises cooperation on its own and the combination does more through the high-rep/low-exploration and low-rep/high-exploration pattern plus the biased updates.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Q-learning model in multi-agent reinforcement learning for evolutionary games. It couples exploration rates to local reputation differences and incorporates asymmetric, state-dependent reputation updates. Simulations are claimed to show that each mechanism independently promotes cooperation and that their combination produces a reinforcing effect through the mapping of high reputation to low exploration (and vice versa), together with status-dependent amplification of cooperative gains at low status and defection penalties at high status.

Significance. If the simulation results are robust, the work would provide a concrete demonstration of how social-evaluation mechanisms can shape exploratory behavior in RL agents and thereby influence the evolution of cooperation. The combination of adaptive exploration and asymmetric reputation updates is a novel modeling choice that could inform both evolutionary game theory and multi-agent RL design.

major comments (2)

[Results / Model definition] The central claim of a reinforcing (synergistic) effect between the two mechanisms rests on the specific asymmetric reputation-update rule. The manuscript should demonstrate that the reported synergy survives under alternative functional forms (e.g., symmetric updates or reversed asymmetry); otherwise the headline result may be an artifact of an untested modeling choice rather than a general consequence of coupling reputation to exploration.
[Methods / Simulation setup] The abstract and methods description provide no information on simulation parameters (population size, payoff matrix, learning rates, number of independent runs), statistical tests, baseline comparisons, or error bars. Without these details the data support for the independent and reinforcing effects cannot be evaluated.

minor comments (1)

[Model] Clarify the precise evolutionary game (e.g., Prisoner's Dilemma parameters) and the exact functional form of the reputation-update rule in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's potential significance and for the constructive major comments. We address each point below and have revised the manuscript accordingly to strengthen the presentation and robustness of the results.

read point-by-point responses

Referee: [Results / Model definition] The central claim of a reinforcing (synergistic) effect between the two mechanisms rests on the specific asymmetric reputation-update rule. The manuscript should demonstrate that the reported synergy survives under alternative functional forms (e.g., symmetric updates or reversed asymmetry); otherwise the headline result may be an artifact of an untested modeling choice rather than a general consequence of coupling reputation to exploration.

Authors: We agree that the headline claim of a reinforcing effect would be more general if shown to be robust to the precise form of the reputation update. The asymmetry in our model is motivated by the social intuition that reputation gains from cooperation are more salient when an agent has low status, while defection penalties are amplified at high status. Nevertheless, to address the concern that the synergy might be an artifact of this choice, we have performed additional simulations with both symmetric updates and reversed asymmetry. These results will be added to the revised manuscript (new figure and accompanying text) to demonstrate that the reinforcing interaction between adaptive exploration and reputation persists, albeit with quantitative differences in the level of cooperation achieved. revision: yes
Referee: [Methods / Simulation setup] The abstract and methods description provide no information on simulation parameters (population size, payoff matrix, learning rates, number of independent runs), statistical tests, baseline comparisons, or error bars. Without these details the data support for the independent and reinforcing effects cannot be evaluated.

Authors: We thank the referee for noting this presentational gap. Although the simulation protocol is described in the main text, we acknowledge that the abstract and the opening of the Methods section did not list the parameters explicitly. In the revised manuscript we have added a dedicated parameter table (population size N=1000, Prisoner's Dilemma payoffs with benefit-to-cost ratio b/c=1.5, learning rate α=0.1, discount factor γ=0.9, 50 independent runs per condition) and have included error bars together with two-sided t-tests or Wilcoxon tests on all key comparisons. Baseline results for standard Q-learning with fixed ε-greedy exploration are already present but are now referenced more explicitly in the main figures. revision: yes

Circularity Check

0 steps flagged

No circularity; simulation outcomes are independent of any self-referential derivation.

full rationale

The paper introduces a Q-learning agent model that couples exploration rates to local reputation differences and applies asymmetric state-dependent reputation updates, then reports cooperation levels from multi-agent simulations. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed empirical patterns; the reported reinforcing effect is an observed numerical outcome under the stated rules rather than a tautological restatement of the model definition itself. The derivation chain consists of standard RL updates plus explicitly chosen functional forms for reputation and exploration, none of which are justified solely by prior work from the same authors or by re-labeling known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities are identifiable. The model likely relies on standard Q-learning hyperparameters and game payoff matrices, but these are not detailed here.

pith-pipeline@v0.9.0 · 5461 in / 1174 out tokens · 47609 ms · 2026-05-10T17:56:32.945184+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ε_i(t) = ε₀ / (1 + tanh[η (R_i(t) − R̄_Ωi(t)) / (R_max − R_min)])
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reputation update rule with δ (asymmetric, state-dependent on threshold A)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fitness f_i = (1−θ)P_i + θ·scaled R_i; Q-learning on lattice PDG

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages

[1]

D. G. Rand and M. A. Nowak, Human cooperation, Trends in Cognitive Sciences17, 413 (2013)

work page 2013
[2]

Axelrod and W

R. Axelrod and W. D. Hamilton, The evolution of coop- eration, Science211, 1390 (1981)

work page 1981
[3]

Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)

K. Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)

work page 2010
[4]

P. A. Van Lange,Social dilemmas: Understanding human cooperation(OUP USA, 2014)

work page 2014
[5]

Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)

E. Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)

work page 2005
[6]

J. M. Smith and G. R. Price, The logic of animal conflict, Nature246, 15 (1973). 11

work page 1973
[7]

P. D. Taylor and L. B. Jonker, Evolutionary stable strate- gies and game dynamics, Mathematical Biosciences40, 145 (1978)

work page 1978
[8]

Ohtsuki, C

H. Ohtsuki, C. Hauert, E. Lieberman, and M. A. Nowak, A simple rule for the evolution of cooperation on graphs and social networks, Nature441, 502 (2006)

work page 2006
[9]

Perc and A

M. Perc and A. Szolnoki, Coevolutionary games—a mini review, BioSystems99, 109 (2010)

work page 2010
[10]

M. Perc, J. J. Jordan, D. G. Rand, Z. Wang, S. Boc- caletti, and A. Szolnoki, Statistical physics of human co- operation, Physics Reports687, 1 (2017)

work page 2017
[11]

C. Wang, M. Perc, and A. Szolnoki, Evolutionary dynam- ics of any multiplayer game on regular graphs, Nature Communications15, 5349 (2024)

work page 2024
[12]

Wang and A

C. Wang and A. Szolnoki, Evolution of cooperation un- der a generalized death-birth process, Physical Review E 107, 024303 (2023)

work page 2023
[13]

Wang and A

C. Wang and A. Szolnoki, Inertia in spatial public goods games under weak selection, Applied Mathematics and Computation449, 127941 (2023)

work page 2023
[14]

C. Wang, W. Zhu, and A. Szolnoki, The conflict between self-interaction and updating passivity in the evolution of cooperation, Chaos, Solitons & Fractals173, 113667 (2023)

work page 2023
[15]

C. Wang, W. Zhu, and A. Szolnoki, When greediness and self-confidence meet in a social dilemma, Physica A625, 129033 (2023)

work page 2023
[16]

Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)

R. Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)

work page 1980
[17]

Szab´ o and C

G. Szab´ o and C. T˝ oke, Evolutionary prisoner’s dilemma game on a square lattice, Physical Review E58, 69 (1998)

work page 1998
[18]

M. A. Nowak,Evolutionary dynamics: exploring the equations of life(Harvard University Press, 2006)

work page 2006
[19]

Sigmund, C

K. Sigmund, C. Hauert, and M. A. Nowak, Reward and punishment, Proceedings of the National Academy of Sci- ences98, 10757 (2001)

work page 2001
[20]

Szolnoki and M

A. Szolnoki and M. Perc, Reward and cooperation in the spatial public goods game, Europhysics Letters92, 38003 (2010)

work page 2010
[21]

Szolnoki, G

A. Szolnoki, G. Szab´ o, and M. Perc, Phase diagrams for the spatial public goods game with pool punishment, Physical Review E83, 036101 (2011)

work page 2011
[22]

W. Zhu, Q. Pan, S. Song, and M. He, Effects of exposure- based reward and punishment on the evolution of coop- eration in prisoner’s dilemma game, Chaos, Solitons & Fractals172, 113519 (2023)

work page 2023
[23]

T. A. Han, M. H. Duong, and M. Perc, Evolutionary mechanisms that promote cooperation may not promote social welfare, Journal of the Royal Society Interface21, 20240547 (2024)

work page 2024
[24]

L. Zhou, B. Wu, J. Du, and L. Wang, Aspiration dynam- ics generate robust predictions in heterogeneous popula- tions, Nature Communications12, 3250 (2021)

work page 2021
[25]

F. Chen, L. Zhou, and L. Wang, Cooperation among un- equal players with aspiration-driven learning, Journal of the Royal Society Interface21, 20230723 (2024)

work page 2024
[26]

J. S. Weitz, C. Eksin, K. Paarporn, S. P. Brown, and W. C. Ratcliff, An oscillating tragedy of the commons in replicator dynamics with game-environment feedback, Proceedings of the National Academy of Sciences113, E7518 (2016)

work page 2016
[27]

A. R. Tilman, J. B. Plotkin, and E. Ak¸ cay, Evolutionary games with environmental feedbacks, Nature communi- cations11, 915 (2020)

work page 2020
[28]

Wang and F

X. Wang and F. Fu, Eco-evolutionary dynamics with en- vironmental feedback: Cooperation in a changing world, Europhysics Letters132, 10001 (2020)

work page 2020
[29]

F. Fu, C. Hauert, M. A. Nowak, and L. Wang, Reputation-based partner choice promotes cooperation in social networks, Physical Review E78, 026117 (2008)

work page 2008
[30]

F. P. Santos, F. C. Santos, and J. M. Pacheco, Social norm complexity and past reputations in the evolution of cooperation, Nature555, 242 (2018)

work page 2018
[31]

C. Xia, J. Wang, M. Perc, and Z. Wang, Reputation and reciprocity, Physics of Life Reviews46, 8 (2023)

work page 2023
[32]

Wang and C

J. Wang and C. Xia, Reputation evaluation and its im- pact on the human cooperation—a recent survey, Euro- physics Letters141, 21001 (2023)

work page 2023
[33]

Ohtsuki and Y

H. Ohtsuki and Y. Iwasa, How should we define good- ness?—reputation dynamics in indirect reciprocity, Jour- nal of Theoretical Biology231, 107 (2004)

work page 2004
[34]

Ohtsuki and Y

H. Ohtsuki and Y. Iwasa, The leading eight: social norms that can maintain cooperation by indirect reciprocity, Journal of theoretical biology239, 435 (2006)

work page 2006
[35]

Hilbe, L

C. Hilbe, L. Schmid, J. Tkadlec, K. Chatterjee, and M. A. Nowak, Indirect reciprocity with private, noisy, and incomplete information, Proceedings of the National Academy of Sciences115, 12241 (2018)

work page 2018
[36]

M. Wei, X. Wang, L. Liu, H. Zheng, Y. Jiang, Y. Hao, Z. Zheng, F. Fu, and S. Tang, Indirect reciprocity in the public goods game with collective reputations, Journal of the Royal Society Interface22, 20240827 (2025)

work page 2025
[37]

M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity by image scoring, Nature393, 573 (1998)

work page 1998
[38]

M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity, Nature437, 1291 (2005)

work page 2005
[39]

W. Zhu, X. Wang, C. Wang, L. Liu, H. Zheng, and S. Tang, Reputation-based synergy and discount- ing mechanism promotes cooperation, New Journal of Physics26, 033046 (2024)

work page 2024
[40]

J. J. Skowronski and D. E. Carlston, Negativity and ex- tremity biases in impression formation: A review of ex- planations, Psychological Bulletin105, 131 (1989)

work page 1989
[41]

S. T. Fiske,Social beings: Core motives in social psychol- ogy(John Wiley & Sons, 2018)

work page 2018
[42]

R. F. Baumeister, E. Bratslavsky, C. Finkenauer, and K. D. Vohs, Bad is stronger than good, Review of general psychology5, 323 (2001)

work page 2001
[43]

I. S. Lim and N. Masuda, To trust or not to trust: Evolu- tionary dynamics of an asymmetric n-player trust game, IEEE Transactions on Evolutionary Computation28, 117 (2023)

work page 2023
[44]

A. R. Fragale, B. Rosen, C. Xu, and I. Merideth, The higher they are, the harder they fall: The effects of wrongdoer status on observer punishment recommenda- tions and intentionality attributions, Organizational Be- havior and Human Decision Processes108, 53 (2009)

work page 2009
[45]

Y. Dong, S. Sun, C. Xia, and M. Perc, Second-order rep- utation promotes cooperation in the spatial prisoner’s dilemma game, IEEE Access7, 82532 (2019)

work page 2019
[46]

Q. Chen, X. Peng, H. Kang, Y. Shen, and X. Sun, The impact of historical-behavior-based asymmetric reputa- tion and deposit mechanisms on the evolutionary spatial public goods game, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0293944 (2025)

work page doi:10.1063/5.0293944 2025
[47]

Koster, M

R. Koster, M. Pˆ ıslar, A. Tacchetti, J. Balaguer, L. Liu, R. Elie, O. P. Hauser, K. Tuyls, M. Botvinick, and 12 C. Summerfield, Deep reinforcement learning can pro- mote sustainable human behaviour in a common-pool resource problem, Nature Communications16, 2824 (2025)

work page 2025
[48]

K. R. McKee, A. Tacchetti, M. A. Bakker, J. Balaguer, L. Campbell-Gillingham, R. Everett, and M. Botvinick, Scaffolding cooperation in human groups with deep re- inforcement learning, Nature Human Behaviour7, 1787 (2023)

work page 2023
[49]

L. Wang, D. Jia, L. Zhang, P. Zhu, M. Perc, L. Shi, and Z. Wang, L´ evy noise promotes cooperation in the pris- oner’s dilemma game with reinforcement learning, Non- linear Dynamics108, 1837 (2022)

work page 2022
[50]

L. Fan, Z. Song, L. Wang, Y. Liu, and Z. Wang, Incorpo- rating social payoff into reinforcement learning promotes cooperation, Chaos: An Interdisciplinary Journal of Non- linear Science32, 10.1063/5.0093996 (2022)

work page doi:10.1063/5.0093996 2022
[51]

Y. Geng, Y. Liu, Y. Lu, C. Shen, and L. Shi, Re- inforcement learning explains various conditional coop- eration, Applied Mathematics and Computation427, 127182 (2022)

work page 2022
[52]

Y. Xu, J. Wang, J. Chen, D. Zhao, M. ¨Ozer, C. Xia, and M. Perc, Reinforcement learning and collective coopera- tion on higher-order networks, Knowledge-Based Systems 301, 112326 (2024)

work page 2024
[53]

Mintz and F

B. Mintz and F. Fu, Evolutionary multi-agent rein- forcement learning in group social dilemmas, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0246332 (2025)

work page doi:10.1063/5.0246332 2025
[54]

Xie and A

K. Xie and A. Szolnoki, Reinforcement learning in evo- lutionary game theory: A brief review of recent devel- opments, Applied Mathematics and Computation510, 129685 (2026)

work page 2026
[55]

Hou, Y.-S

Y. Hou, Y.-S. Ong, L. Feng, and J. M. Zurada, An evo- lutionary transfer reinforcement learning framework for multiagent systems, IEEE Transactions on Evolutionary Computation21, 601 (2017)

work page 2017
[56]

Zou and C

K. Zou and C. Huang, Incorporating reputation into re- inforcement learning can promote cooperation on hyper- graphs, Chaos, Solitons & Fractals186, 115203 (2024)

work page 2024
[57]

Ren and X.-J

T. Ren and X.-J. Zeng, Reputation-based interaction promotes cooperation with reinforcement learning, IEEE Transactions on Evolutionary Computation28, 1177 (2023)

work page 2023
[58]

Xie and A

K. Xie and A. Szolnoki, Reputation in public goods coop- eration under double q-learning protocol, Chaos, Solitons & Fractals196, 116398 (2025)

work page 2025
[59]

T. Ren, X. Yao, Y. Li, and X.-J. Zeng, Bottom-up reputation promotes cooperation with multi-agent re- inforcement learning, arXiv preprint arXiv:2502.01971 10.48550/arXiv.2502.01971 (2025)

work page doi:10.48550/arxiv.2502.01971 2025
[60]

Y. Zhu, B. Xing, and C. Xia, Q-learning update with second-order reputation promotes the evolution of trust within structured populations, Chaos, Solitons & Frac- tals199, 116653 (2025)

work page 2025
[61]

Zhang and X

Q. Zhang and X. Zhang, Q-learning driven cooperative evolution with dual-reputation incentive mechanisms, Applied Mathematics and Computation507, 129590 (2025)

work page 2025
[62]

C. J. Watkins and P. Dayan, Q-learning, Machine Learn- ing8, 279 (1992)

work page 1992
[63]

R. S. Sutton, A. G. Barto,et al.,Reinforcement learn- ing: an introduction, 2nd edn. Adaptive computation and machine learning, Vol. 1 (MIT press Cambridge, 2018)

work page 2018
[64]

Tokic and G

M. Tokic and G. Palm, Value-difference based explo- ration: adaptive control between epsilon-greedy and softmax, inAnnual conference on artificial intelligence (Springer, 2011) pp. 335–346

work page 2011
[65]

S. Shen, X. Zhang, A. Xu, and T. Duan, An adaptive exploration mechanism for q-learning in spatial public goods games, Chaos, Solitons & Fractals189, 115705 (2024)

work page 2024
[66]

Milinski, D

M. Milinski, D. Semmann, and H.-J. Krambeck, Repu- tation helps solve the ‘tragedy of the commons’, Nature 415, 424 (2002)

work page 2002
[67]

Fudenberg and D

D. Fudenberg and D. K. Levine, Maintaining a reputation when strategies are imperfectly observed, The Review of Economic Studies59, 561 (1992)

work page 1992
[68]

M. A. Nowak and R. M. May, Evolutionary games and spatial chaos, nature359, 826 (1992)

work page 1992
[69]

W. Zhu, Q. Pan, and M. He, Exposure-based reputa- tion mechanism promotes the evolution of cooperation, Chaos, Solitons & Fractals160, 112205 (2022)

work page 2022

[1] [1]

D. G. Rand and M. A. Nowak, Human cooperation, Trends in Cognitive Sciences17, 413 (2013)

work page 2013

[2] [2]

Axelrod and W

R. Axelrod and W. D. Hamilton, The evolution of coop- eration, Science211, 1390 (1981)

work page 1981

[3] [3]

Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)

K. Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)

work page 2010

[4] [4]

P. A. Van Lange,Social dilemmas: Understanding human cooperation(OUP USA, 2014)

work page 2014

[5] [5]

Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)

E. Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)

work page 2005

[6] [6]

J. M. Smith and G. R. Price, The logic of animal conflict, Nature246, 15 (1973). 11

work page 1973

[7] [7]

P. D. Taylor and L. B. Jonker, Evolutionary stable strate- gies and game dynamics, Mathematical Biosciences40, 145 (1978)

work page 1978

[8] [8]

Ohtsuki, C

H. Ohtsuki, C. Hauert, E. Lieberman, and M. A. Nowak, A simple rule for the evolution of cooperation on graphs and social networks, Nature441, 502 (2006)

work page 2006

[9] [9]

Perc and A

M. Perc and A. Szolnoki, Coevolutionary games—a mini review, BioSystems99, 109 (2010)

work page 2010

[10] [10]

M. Perc, J. J. Jordan, D. G. Rand, Z. Wang, S. Boc- caletti, and A. Szolnoki, Statistical physics of human co- operation, Physics Reports687, 1 (2017)

work page 2017

[11] [11]

C. Wang, M. Perc, and A. Szolnoki, Evolutionary dynam- ics of any multiplayer game on regular graphs, Nature Communications15, 5349 (2024)

work page 2024

[12] [12]

Wang and A

C. Wang and A. Szolnoki, Evolution of cooperation un- der a generalized death-birth process, Physical Review E 107, 024303 (2023)

work page 2023

[13] [13]

Wang and A

C. Wang and A. Szolnoki, Inertia in spatial public goods games under weak selection, Applied Mathematics and Computation449, 127941 (2023)

work page 2023

[14] [14]

C. Wang, W. Zhu, and A. Szolnoki, The conflict between self-interaction and updating passivity in the evolution of cooperation, Chaos, Solitons & Fractals173, 113667 (2023)

work page 2023

[15] [15]

C. Wang, W. Zhu, and A. Szolnoki, When greediness and self-confidence meet in a social dilemma, Physica A625, 129033 (2023)

work page 2023

[16] [16]

Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)

R. Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)

work page 1980

[17] [17]

Szab´ o and C

G. Szab´ o and C. T˝ oke, Evolutionary prisoner’s dilemma game on a square lattice, Physical Review E58, 69 (1998)

work page 1998

[18] [18]

M. A. Nowak,Evolutionary dynamics: exploring the equations of life(Harvard University Press, 2006)

work page 2006

[19] [19]

Sigmund, C

K. Sigmund, C. Hauert, and M. A. Nowak, Reward and punishment, Proceedings of the National Academy of Sci- ences98, 10757 (2001)

work page 2001

[20] [20]

Szolnoki and M

A. Szolnoki and M. Perc, Reward and cooperation in the spatial public goods game, Europhysics Letters92, 38003 (2010)

work page 2010

[21] [21]

Szolnoki, G

A. Szolnoki, G. Szab´ o, and M. Perc, Phase diagrams for the spatial public goods game with pool punishment, Physical Review E83, 036101 (2011)

work page 2011

[22] [22]

W. Zhu, Q. Pan, S. Song, and M. He, Effects of exposure- based reward and punishment on the evolution of coop- eration in prisoner’s dilemma game, Chaos, Solitons & Fractals172, 113519 (2023)

work page 2023

[23] [23]

T. A. Han, M. H. Duong, and M. Perc, Evolutionary mechanisms that promote cooperation may not promote social welfare, Journal of the Royal Society Interface21, 20240547 (2024)

work page 2024

[24] [24]

L. Zhou, B. Wu, J. Du, and L. Wang, Aspiration dynam- ics generate robust predictions in heterogeneous popula- tions, Nature Communications12, 3250 (2021)

work page 2021

[25] [25]

F. Chen, L. Zhou, and L. Wang, Cooperation among un- equal players with aspiration-driven learning, Journal of the Royal Society Interface21, 20230723 (2024)

work page 2024

[26] [26]

J. S. Weitz, C. Eksin, K. Paarporn, S. P. Brown, and W. C. Ratcliff, An oscillating tragedy of the commons in replicator dynamics with game-environment feedback, Proceedings of the National Academy of Sciences113, E7518 (2016)

work page 2016

[27] [27]

A. R. Tilman, J. B. Plotkin, and E. Ak¸ cay, Evolutionary games with environmental feedbacks, Nature communi- cations11, 915 (2020)

work page 2020

[28] [28]

Wang and F

X. Wang and F. Fu, Eco-evolutionary dynamics with en- vironmental feedback: Cooperation in a changing world, Europhysics Letters132, 10001 (2020)

work page 2020

[29] [29]

F. Fu, C. Hauert, M. A. Nowak, and L. Wang, Reputation-based partner choice promotes cooperation in social networks, Physical Review E78, 026117 (2008)

work page 2008

[30] [30]

F. P. Santos, F. C. Santos, and J. M. Pacheco, Social norm complexity and past reputations in the evolution of cooperation, Nature555, 242 (2018)

work page 2018

[31] [31]

C. Xia, J. Wang, M. Perc, and Z. Wang, Reputation and reciprocity, Physics of Life Reviews46, 8 (2023)

work page 2023

[32] [32]

Wang and C

J. Wang and C. Xia, Reputation evaluation and its im- pact on the human cooperation—a recent survey, Euro- physics Letters141, 21001 (2023)

work page 2023

[33] [33]

Ohtsuki and Y

H. Ohtsuki and Y. Iwasa, How should we define good- ness?—reputation dynamics in indirect reciprocity, Jour- nal of Theoretical Biology231, 107 (2004)

work page 2004

[34] [34]

Ohtsuki and Y

H. Ohtsuki and Y. Iwasa, The leading eight: social norms that can maintain cooperation by indirect reciprocity, Journal of theoretical biology239, 435 (2006)

work page 2006

[35] [35]

Hilbe, L

C. Hilbe, L. Schmid, J. Tkadlec, K. Chatterjee, and M. A. Nowak, Indirect reciprocity with private, noisy, and incomplete information, Proceedings of the National Academy of Sciences115, 12241 (2018)

work page 2018

[36] [36]

M. Wei, X. Wang, L. Liu, H. Zheng, Y. Jiang, Y. Hao, Z. Zheng, F. Fu, and S. Tang, Indirect reciprocity in the public goods game with collective reputations, Journal of the Royal Society Interface22, 20240827 (2025)

work page 2025

[37] [37]

M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity by image scoring, Nature393, 573 (1998)

work page 1998

[38] [38]

M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity, Nature437, 1291 (2005)

work page 2005

[39] [39]

W. Zhu, X. Wang, C. Wang, L. Liu, H. Zheng, and S. Tang, Reputation-based synergy and discount- ing mechanism promotes cooperation, New Journal of Physics26, 033046 (2024)

work page 2024

[40] [40]

J. J. Skowronski and D. E. Carlston, Negativity and ex- tremity biases in impression formation: A review of ex- planations, Psychological Bulletin105, 131 (1989)

work page 1989

[41] [41]

S. T. Fiske,Social beings: Core motives in social psychol- ogy(John Wiley & Sons, 2018)

work page 2018

[42] [42]

R. F. Baumeister, E. Bratslavsky, C. Finkenauer, and K. D. Vohs, Bad is stronger than good, Review of general psychology5, 323 (2001)

work page 2001

[43] [43]

I. S. Lim and N. Masuda, To trust or not to trust: Evolu- tionary dynamics of an asymmetric n-player trust game, IEEE Transactions on Evolutionary Computation28, 117 (2023)

work page 2023

[44] [44]

A. R. Fragale, B. Rosen, C. Xu, and I. Merideth, The higher they are, the harder they fall: The effects of wrongdoer status on observer punishment recommenda- tions and intentionality attributions, Organizational Be- havior and Human Decision Processes108, 53 (2009)

work page 2009

[45] [45]

Y. Dong, S. Sun, C. Xia, and M. Perc, Second-order rep- utation promotes cooperation in the spatial prisoner’s dilemma game, IEEE Access7, 82532 (2019)

work page 2019

[46] [46]

Q. Chen, X. Peng, H. Kang, Y. Shen, and X. Sun, The impact of historical-behavior-based asymmetric reputa- tion and deposit mechanisms on the evolutionary spatial public goods game, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0293944 (2025)

work page doi:10.1063/5.0293944 2025

[47] [47]

Koster, M

R. Koster, M. Pˆ ıslar, A. Tacchetti, J. Balaguer, L. Liu, R. Elie, O. P. Hauser, K. Tuyls, M. Botvinick, and 12 C. Summerfield, Deep reinforcement learning can pro- mote sustainable human behaviour in a common-pool resource problem, Nature Communications16, 2824 (2025)

work page 2025

[48] [48]

K. R. McKee, A. Tacchetti, M. A. Bakker, J. Balaguer, L. Campbell-Gillingham, R. Everett, and M. Botvinick, Scaffolding cooperation in human groups with deep re- inforcement learning, Nature Human Behaviour7, 1787 (2023)

work page 2023

[49] [49]

L. Wang, D. Jia, L. Zhang, P. Zhu, M. Perc, L. Shi, and Z. Wang, L´ evy noise promotes cooperation in the pris- oner’s dilemma game with reinforcement learning, Non- linear Dynamics108, 1837 (2022)

work page 2022

[50] [50]

L. Fan, Z. Song, L. Wang, Y. Liu, and Z. Wang, Incorpo- rating social payoff into reinforcement learning promotes cooperation, Chaos: An Interdisciplinary Journal of Non- linear Science32, 10.1063/5.0093996 (2022)

work page doi:10.1063/5.0093996 2022

[51] [51]

Y. Geng, Y. Liu, Y. Lu, C. Shen, and L. Shi, Re- inforcement learning explains various conditional coop- eration, Applied Mathematics and Computation427, 127182 (2022)

work page 2022

[52] [52]

Y. Xu, J. Wang, J. Chen, D. Zhao, M. ¨Ozer, C. Xia, and M. Perc, Reinforcement learning and collective coopera- tion on higher-order networks, Knowledge-Based Systems 301, 112326 (2024)

work page 2024

[53] [53]

Mintz and F

B. Mintz and F. Fu, Evolutionary multi-agent rein- forcement learning in group social dilemmas, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0246332 (2025)

work page doi:10.1063/5.0246332 2025

[54] [54]

Xie and A

K. Xie and A. Szolnoki, Reinforcement learning in evo- lutionary game theory: A brief review of recent devel- opments, Applied Mathematics and Computation510, 129685 (2026)

work page 2026

[55] [55]

Hou, Y.-S

Y. Hou, Y.-S. Ong, L. Feng, and J. M. Zurada, An evo- lutionary transfer reinforcement learning framework for multiagent systems, IEEE Transactions on Evolutionary Computation21, 601 (2017)

work page 2017

[56] [56]

Zou and C

K. Zou and C. Huang, Incorporating reputation into re- inforcement learning can promote cooperation on hyper- graphs, Chaos, Solitons & Fractals186, 115203 (2024)

work page 2024

[57] [57]

Ren and X.-J

T. Ren and X.-J. Zeng, Reputation-based interaction promotes cooperation with reinforcement learning, IEEE Transactions on Evolutionary Computation28, 1177 (2023)

work page 2023

[58] [58]

Xie and A

K. Xie and A. Szolnoki, Reputation in public goods coop- eration under double q-learning protocol, Chaos, Solitons & Fractals196, 116398 (2025)

work page 2025

[59] [59]

T. Ren, X. Yao, Y. Li, and X.-J. Zeng, Bottom-up reputation promotes cooperation with multi-agent re- inforcement learning, arXiv preprint arXiv:2502.01971 10.48550/arXiv.2502.01971 (2025)

work page doi:10.48550/arxiv.2502.01971 2025

[60] [60]

Y. Zhu, B. Xing, and C. Xia, Q-learning update with second-order reputation promotes the evolution of trust within structured populations, Chaos, Solitons & Frac- tals199, 116653 (2025)

work page 2025

[61] [61]

Zhang and X

Q. Zhang and X. Zhang, Q-learning driven cooperative evolution with dual-reputation incentive mechanisms, Applied Mathematics and Computation507, 129590 (2025)

work page 2025

[62] [62]

C. J. Watkins and P. Dayan, Q-learning, Machine Learn- ing8, 279 (1992)

work page 1992

[63] [63]

R. S. Sutton, A. G. Barto,et al.,Reinforcement learn- ing: an introduction, 2nd edn. Adaptive computation and machine learning, Vol. 1 (MIT press Cambridge, 2018)

work page 2018

[64] [64]

Tokic and G

M. Tokic and G. Palm, Value-difference based explo- ration: adaptive control between epsilon-greedy and softmax, inAnnual conference on artificial intelligence (Springer, 2011) pp. 335–346

work page 2011

[65] [65]

S. Shen, X. Zhang, A. Xu, and T. Duan, An adaptive exploration mechanism for q-learning in spatial public goods games, Chaos, Solitons & Fractals189, 115705 (2024)

work page 2024

[66] [66]

Milinski, D

M. Milinski, D. Semmann, and H.-J. Krambeck, Repu- tation helps solve the ‘tragedy of the commons’, Nature 415, 424 (2002)

work page 2002

[67] [67]

Fudenberg and D

D. Fudenberg and D. K. Levine, Maintaining a reputation when strategies are imperfectly observed, The Review of Economic Studies59, 561 (1992)

work page 1992

[68] [68]

M. A. Nowak and R. M. May, Evolutionary games and spatial chaos, nature359, 826 (1992)

work page 1992

[69] [69]

W. Zhu, Q. Pan, and M. He, Exposure-based reputa- tion mechanism promotes the evolution of cooperation, Chaos, Solitons & Fractals160, 112205 (2022)

work page 2022