Reinforcement learning with reputation-based adaptive exploration promotes the evolution of cooperation
Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3
The pith
Coupling exploration to local reputation in Q-learning promotes cooperation in evolutionary games.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making high reputation agents explore less and low reputation agents explore more, while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status.
What carries the argument
Q-learning with exploration rates tied to local reputation differences together with asymmetric state-dependent reputation updates.
If this is right
- Cooperation levels rise further when the two mechanisms operate together than when either is used alone.
- High-reputation agents become more likely to exploit known cooperative strategies while low-reputation agents continue to sample alternatives.
- Reputation payoffs become larger for cooperation when an agent has low status and larger for defection when an agent has high status.
Where Pith is reading between the lines
- The same coupling could be tested in settings where reputation is observed with noise or delay.
- The approach suggests a route for designing multi-agent systems in which status signals naturally reduce wasteful exploration once good strategies are found.
- Real-world reputation platforms might be examined to see whether they produce analogous exploration patterns among users.
Load-bearing premise
Agents can accurately perceive and respond to differences in local reputation when they adjust their exploration rates.
What would settle it
Simulations that keep exploration fixed and use symmetric reputation updates would show no comparable rise in cooperation levels.
Figures
read the original abstract
Multi-agent reinforcement learning serves as an effective tool for studying strategy adaptation in evolutionary games. Although prior work has integrated Q-learning with reputation mechanisms to promote cooperation, most existing algorithms adopt fixed exploration rates and overlook the influence of social context on exploratory behavior. In practice, individuals may adjust their willingness to explore based on their reputation and perceived social standing. To address this, we propose a Q-learning model that couples exploration rates with local reputation differences and incorporates asymmetric, state-dependent reputation updates. Our results show that each mechanism independently promotes cooperation, and their combination yields a reinforcing effect. The joint mechanism enhances cooperation by making ``high reputation--low exploration, low reputation--high exploration'', while adjusting reputation updates to amplify cooperative gains at low status and defection penalties at high status. This study thus offers insights into how social evaluation can shape learning behavior in complex environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Q-learning model in multi-agent reinforcement learning for evolutionary games. It couples exploration rates to local reputation differences and incorporates asymmetric, state-dependent reputation updates. Simulations are claimed to show that each mechanism independently promotes cooperation and that their combination produces a reinforcing effect through the mapping of high reputation to low exploration (and vice versa), together with status-dependent amplification of cooperative gains at low status and defection penalties at high status.
Significance. If the simulation results are robust, the work would provide a concrete demonstration of how social-evaluation mechanisms can shape exploratory behavior in RL agents and thereby influence the evolution of cooperation. The combination of adaptive exploration and asymmetric reputation updates is a novel modeling choice that could inform both evolutionary game theory and multi-agent RL design.
major comments (2)
- [Results / Model definition] The central claim of a reinforcing (synergistic) effect between the two mechanisms rests on the specific asymmetric reputation-update rule. The manuscript should demonstrate that the reported synergy survives under alternative functional forms (e.g., symmetric updates or reversed asymmetry); otherwise the headline result may be an artifact of an untested modeling choice rather than a general consequence of coupling reputation to exploration.
- [Methods / Simulation setup] The abstract and methods description provide no information on simulation parameters (population size, payoff matrix, learning rates, number of independent runs), statistical tests, baseline comparisons, or error bars. Without these details the data support for the independent and reinforcing effects cannot be evaluated.
minor comments (1)
- [Model] Clarify the precise evolutionary game (e.g., Prisoner's Dilemma parameters) and the exact functional form of the reputation-update rule in the main text rather than only in supplementary material.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our work's potential significance and for the constructive major comments. We address each point below and have revised the manuscript accordingly to strengthen the presentation and robustness of the results.
read point-by-point responses
-
Referee: [Results / Model definition] The central claim of a reinforcing (synergistic) effect between the two mechanisms rests on the specific asymmetric reputation-update rule. The manuscript should demonstrate that the reported synergy survives under alternative functional forms (e.g., symmetric updates or reversed asymmetry); otherwise the headline result may be an artifact of an untested modeling choice rather than a general consequence of coupling reputation to exploration.
Authors: We agree that the headline claim of a reinforcing effect would be more general if shown to be robust to the precise form of the reputation update. The asymmetry in our model is motivated by the social intuition that reputation gains from cooperation are more salient when an agent has low status, while defection penalties are amplified at high status. Nevertheless, to address the concern that the synergy might be an artifact of this choice, we have performed additional simulations with both symmetric updates and reversed asymmetry. These results will be added to the revised manuscript (new figure and accompanying text) to demonstrate that the reinforcing interaction between adaptive exploration and reputation persists, albeit with quantitative differences in the level of cooperation achieved. revision: yes
-
Referee: [Methods / Simulation setup] The abstract and methods description provide no information on simulation parameters (population size, payoff matrix, learning rates, number of independent runs), statistical tests, baseline comparisons, or error bars. Without these details the data support for the independent and reinforcing effects cannot be evaluated.
Authors: We thank the referee for noting this presentational gap. Although the simulation protocol is described in the main text, we acknowledge that the abstract and the opening of the Methods section did not list the parameters explicitly. In the revised manuscript we have added a dedicated parameter table (population size N=1000, Prisoner's Dilemma payoffs with benefit-to-cost ratio b/c=1.5, learning rate α=0.1, discount factor γ=0.9, 50 independent runs per condition) and have included error bars together with two-sided t-tests or Wilcoxon tests on all key comparisons. Baseline results for standard Q-learning with fixed ε-greedy exploration are already present but are now referenced more explicitly in the main figures. revision: yes
Circularity Check
No circularity; simulation outcomes are independent of any self-referential derivation.
full rationale
The paper introduces a Q-learning agent model that couples exploration rates to local reputation differences and applies asymmetric state-dependent reputation updates, then reports cooperation levels from multi-agent simulations. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed empirical patterns; the reported reinforcing effect is an observed numerical outcome under the stated rules rather than a tautological restatement of the model definition itself. The derivation chain consists of standard RL updates plus explicitly chosen functional forms for reputation and exploration, none of which are justified solely by prior work from the same authors or by re-labeling known results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ε_i(t) = ε₀ / (1 + tanh[η (R_i(t) − R̄_Ωi(t)) / (R_max − R_min)])
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Reputation update rule with δ (asymmetric, state-dependent on threshold A)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fitness f_i = (1−θ)P_i + θ·scaled R_i; Q-learning on lattice PDG
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D. G. Rand and M. A. Nowak, Human cooperation, Trends in Cognitive Sciences17, 413 (2013)
work page 2013
-
[2]
R. Axelrod and W. D. Hamilton, The evolution of coop- eration, Science211, 1390 (1981)
work page 1981
-
[3]
Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)
K. Sigmund,The calculus of selfishness(Princeton Uni- versity Press, 2010)
work page 2010
-
[4]
P. A. Van Lange,Social dilemmas: Understanding human cooperation(OUP USA, 2014)
work page 2014
-
[5]
Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)
E. Pennisi, How did cooperative behavior evolve?, Sci- ence309, 93 (2005)
work page 2005
-
[6]
J. M. Smith and G. R. Price, The logic of animal conflict, Nature246, 15 (1973). 11
work page 1973
-
[7]
P. D. Taylor and L. B. Jonker, Evolutionary stable strate- gies and game dynamics, Mathematical Biosciences40, 145 (1978)
work page 1978
-
[8]
H. Ohtsuki, C. Hauert, E. Lieberman, and M. A. Nowak, A simple rule for the evolution of cooperation on graphs and social networks, Nature441, 502 (2006)
work page 2006
-
[9]
M. Perc and A. Szolnoki, Coevolutionary games—a mini review, BioSystems99, 109 (2010)
work page 2010
-
[10]
M. Perc, J. J. Jordan, D. G. Rand, Z. Wang, S. Boc- caletti, and A. Szolnoki, Statistical physics of human co- operation, Physics Reports687, 1 (2017)
work page 2017
-
[11]
C. Wang, M. Perc, and A. Szolnoki, Evolutionary dynam- ics of any multiplayer game on regular graphs, Nature Communications15, 5349 (2024)
work page 2024
-
[12]
C. Wang and A. Szolnoki, Evolution of cooperation un- der a generalized death-birth process, Physical Review E 107, 024303 (2023)
work page 2023
-
[13]
C. Wang and A. Szolnoki, Inertia in spatial public goods games under weak selection, Applied Mathematics and Computation449, 127941 (2023)
work page 2023
-
[14]
C. Wang, W. Zhu, and A. Szolnoki, The conflict between self-interaction and updating passivity in the evolution of cooperation, Chaos, Solitons & Fractals173, 113667 (2023)
work page 2023
-
[15]
C. Wang, W. Zhu, and A. Szolnoki, When greediness and self-confidence meet in a social dilemma, Physica A625, 129033 (2023)
work page 2023
-
[16]
Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)
R. Axelrod, Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution24, 3 (1980)
work page 1980
-
[17]
G. Szab´ o and C. T˝ oke, Evolutionary prisoner’s dilemma game on a square lattice, Physical Review E58, 69 (1998)
work page 1998
-
[18]
M. A. Nowak,Evolutionary dynamics: exploring the equations of life(Harvard University Press, 2006)
work page 2006
-
[19]
K. Sigmund, C. Hauert, and M. A. Nowak, Reward and punishment, Proceedings of the National Academy of Sci- ences98, 10757 (2001)
work page 2001
-
[20]
A. Szolnoki and M. Perc, Reward and cooperation in the spatial public goods game, Europhysics Letters92, 38003 (2010)
work page 2010
-
[21]
A. Szolnoki, G. Szab´ o, and M. Perc, Phase diagrams for the spatial public goods game with pool punishment, Physical Review E83, 036101 (2011)
work page 2011
-
[22]
W. Zhu, Q. Pan, S. Song, and M. He, Effects of exposure- based reward and punishment on the evolution of coop- eration in prisoner’s dilemma game, Chaos, Solitons & Fractals172, 113519 (2023)
work page 2023
-
[23]
T. A. Han, M. H. Duong, and M. Perc, Evolutionary mechanisms that promote cooperation may not promote social welfare, Journal of the Royal Society Interface21, 20240547 (2024)
work page 2024
-
[24]
L. Zhou, B. Wu, J. Du, and L. Wang, Aspiration dynam- ics generate robust predictions in heterogeneous popula- tions, Nature Communications12, 3250 (2021)
work page 2021
-
[25]
F. Chen, L. Zhou, and L. Wang, Cooperation among un- equal players with aspiration-driven learning, Journal of the Royal Society Interface21, 20230723 (2024)
work page 2024
-
[26]
J. S. Weitz, C. Eksin, K. Paarporn, S. P. Brown, and W. C. Ratcliff, An oscillating tragedy of the commons in replicator dynamics with game-environment feedback, Proceedings of the National Academy of Sciences113, E7518 (2016)
work page 2016
-
[27]
A. R. Tilman, J. B. Plotkin, and E. Ak¸ cay, Evolutionary games with environmental feedbacks, Nature communi- cations11, 915 (2020)
work page 2020
-
[28]
X. Wang and F. Fu, Eco-evolutionary dynamics with en- vironmental feedback: Cooperation in a changing world, Europhysics Letters132, 10001 (2020)
work page 2020
-
[29]
F. Fu, C. Hauert, M. A. Nowak, and L. Wang, Reputation-based partner choice promotes cooperation in social networks, Physical Review E78, 026117 (2008)
work page 2008
-
[30]
F. P. Santos, F. C. Santos, and J. M. Pacheco, Social norm complexity and past reputations in the evolution of cooperation, Nature555, 242 (2018)
work page 2018
-
[31]
C. Xia, J. Wang, M. Perc, and Z. Wang, Reputation and reciprocity, Physics of Life Reviews46, 8 (2023)
work page 2023
-
[32]
J. Wang and C. Xia, Reputation evaluation and its im- pact on the human cooperation—a recent survey, Euro- physics Letters141, 21001 (2023)
work page 2023
-
[33]
H. Ohtsuki and Y. Iwasa, How should we define good- ness?—reputation dynamics in indirect reciprocity, Jour- nal of Theoretical Biology231, 107 (2004)
work page 2004
-
[34]
H. Ohtsuki and Y. Iwasa, The leading eight: social norms that can maintain cooperation by indirect reciprocity, Journal of theoretical biology239, 435 (2006)
work page 2006
- [35]
-
[36]
M. Wei, X. Wang, L. Liu, H. Zheng, Y. Jiang, Y. Hao, Z. Zheng, F. Fu, and S. Tang, Indirect reciprocity in the public goods game with collective reputations, Journal of the Royal Society Interface22, 20240827 (2025)
work page 2025
-
[37]
M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity by image scoring, Nature393, 573 (1998)
work page 1998
-
[38]
M. A. Nowak and K. Sigmund, Evolution of indirect reci- procity, Nature437, 1291 (2005)
work page 2005
-
[39]
W. Zhu, X. Wang, C. Wang, L. Liu, H. Zheng, and S. Tang, Reputation-based synergy and discount- ing mechanism promotes cooperation, New Journal of Physics26, 033046 (2024)
work page 2024
-
[40]
J. J. Skowronski and D. E. Carlston, Negativity and ex- tremity biases in impression formation: A review of ex- planations, Psychological Bulletin105, 131 (1989)
work page 1989
-
[41]
S. T. Fiske,Social beings: Core motives in social psychol- ogy(John Wiley & Sons, 2018)
work page 2018
-
[42]
R. F. Baumeister, E. Bratslavsky, C. Finkenauer, and K. D. Vohs, Bad is stronger than good, Review of general psychology5, 323 (2001)
work page 2001
-
[43]
I. S. Lim and N. Masuda, To trust or not to trust: Evolu- tionary dynamics of an asymmetric n-player trust game, IEEE Transactions on Evolutionary Computation28, 117 (2023)
work page 2023
-
[44]
A. R. Fragale, B. Rosen, C. Xu, and I. Merideth, The higher they are, the harder they fall: The effects of wrongdoer status on observer punishment recommenda- tions and intentionality attributions, Organizational Be- havior and Human Decision Processes108, 53 (2009)
work page 2009
-
[45]
Y. Dong, S. Sun, C. Xia, and M. Perc, Second-order rep- utation promotes cooperation in the spatial prisoner’s dilemma game, IEEE Access7, 82532 (2019)
work page 2019
-
[46]
Q. Chen, X. Peng, H. Kang, Y. Shen, and X. Sun, The impact of historical-behavior-based asymmetric reputa- tion and deposit mechanisms on the evolutionary spatial public goods game, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0293944 (2025)
- [47]
-
[48]
K. R. McKee, A. Tacchetti, M. A. Bakker, J. Balaguer, L. Campbell-Gillingham, R. Everett, and M. Botvinick, Scaffolding cooperation in human groups with deep re- inforcement learning, Nature Human Behaviour7, 1787 (2023)
work page 2023
-
[49]
L. Wang, D. Jia, L. Zhang, P. Zhu, M. Perc, L. Shi, and Z. Wang, L´ evy noise promotes cooperation in the pris- oner’s dilemma game with reinforcement learning, Non- linear Dynamics108, 1837 (2022)
work page 2022
-
[50]
L. Fan, Z. Song, L. Wang, Y. Liu, and Z. Wang, Incorpo- rating social payoff into reinforcement learning promotes cooperation, Chaos: An Interdisciplinary Journal of Non- linear Science32, 10.1063/5.0093996 (2022)
-
[51]
Y. Geng, Y. Liu, Y. Lu, C. Shen, and L. Shi, Re- inforcement learning explains various conditional coop- eration, Applied Mathematics and Computation427, 127182 (2022)
work page 2022
-
[52]
Y. Xu, J. Wang, J. Chen, D. Zhao, M. ¨Ozer, C. Xia, and M. Perc, Reinforcement learning and collective coopera- tion on higher-order networks, Knowledge-Based Systems 301, 112326 (2024)
work page 2024
-
[53]
B. Mintz and F. Fu, Evolutionary multi-agent rein- forcement learning in group social dilemmas, Chaos: An Interdisciplinary Journal of Nonlinear Science35, 10.1063/5.0246332 (2025)
- [54]
- [55]
- [56]
-
[57]
T. Ren and X.-J. Zeng, Reputation-based interaction promotes cooperation with reinforcement learning, IEEE Transactions on Evolutionary Computation28, 1177 (2023)
work page 2023
- [58]
-
[59]
T. Ren, X. Yao, Y. Li, and X.-J. Zeng, Bottom-up reputation promotes cooperation with multi-agent re- inforcement learning, arXiv preprint arXiv:2502.01971 10.48550/arXiv.2502.01971 (2025)
-
[60]
Y. Zhu, B. Xing, and C. Xia, Q-learning update with second-order reputation promotes the evolution of trust within structured populations, Chaos, Solitons & Frac- tals199, 116653 (2025)
work page 2025
-
[61]
Q. Zhang and X. Zhang, Q-learning driven cooperative evolution with dual-reputation incentive mechanisms, Applied Mathematics and Computation507, 129590 (2025)
work page 2025
-
[62]
C. J. Watkins and P. Dayan, Q-learning, Machine Learn- ing8, 279 (1992)
work page 1992
-
[63]
R. S. Sutton, A. G. Barto,et al.,Reinforcement learn- ing: an introduction, 2nd edn. Adaptive computation and machine learning, Vol. 1 (MIT press Cambridge, 2018)
work page 2018
-
[64]
M. Tokic and G. Palm, Value-difference based explo- ration: adaptive control between epsilon-greedy and softmax, inAnnual conference on artificial intelligence (Springer, 2011) pp. 335–346
work page 2011
-
[65]
S. Shen, X. Zhang, A. Xu, and T. Duan, An adaptive exploration mechanism for q-learning in spatial public goods games, Chaos, Solitons & Fractals189, 115705 (2024)
work page 2024
-
[66]
M. Milinski, D. Semmann, and H.-J. Krambeck, Repu- tation helps solve the ‘tragedy of the commons’, Nature 415, 424 (2002)
work page 2002
-
[67]
D. Fudenberg and D. K. Levine, Maintaining a reputation when strategies are imperfectly observed, The Review of Economic Studies59, 561 (1992)
work page 1992
-
[68]
M. A. Nowak and R. M. May, Evolutionary games and spatial chaos, nature359, 826 (1992)
work page 1992
-
[69]
W. Zhu, Q. Pan, and M. He, Exposure-based reputa- tion mechanism promotes the evolution of cooperation, Chaos, Solitons & Fractals160, 112205 (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.