arxiv: 2605.08323 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

The Reciprocity Gradient

Yue Lin , Pascal Poupart , Shuhui Zhu , Dan Qiao , Wenhao Li , Yuan Liu , Hongyuan Zha , Baoxiang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reciprocity gradientinfluence attributionmulti-agent learningreputation chainscooperative communicationpolicy optimizationanalytical gradientsopponent modeling

0 comments

The pith

The reciprocity gradient lets agents attribute influence by analytically backpropagating rewards through private estimators of opponents' policies trained on public observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In strategic settings where agents communicate to sustain cooperation, every signal or action can alter the reputations and future behaviors of many third parties along branching paths before those changes affect the original agent's own rewards. This creates an influence attribution problem that makes it difficult for standard learning methods to decide what to do or say at each step. The paper introduces the reciprocity gradient to address this by training private estimators of opponents' policies from public data and then flowing reward gradients through those estimators and the reputation chain in closed form. The approach jointly tunes both actions and signals without any hand-crafted intrinsic rewards. If the method works as described, agents can discover context-sensitive cooperative strategies that collapse to constant policies under sampling-based alternatives.

Core claim

The central claim is that the influence attribution problem in communicative reciprocity can be solved by explicitly backpropagating reward gradients through private estimators of opponents' policies that are trained from public observations; this produces an analytical gradient that traverses the entire reputation chain rather than relying on sampled returns, enabling joint optimization of behavior and signaling.

What carries the argument

The reciprocity gradient, which routes reward signals analytically through private estimators of opponents' policies trained from public observations so that indirect effects along reputation chains are accounted for during policy updates.

If this is right

Agents can recover near-optimal context-sensitive policies in multi-agent games with communication.
Actions and evaluative signals are optimized together without separate reward shaping.
Sample-based credit assignment is replaced by lower-variance analytical propagation through estimated opponent models.
The approach avoids the collapse to constant-output policies observed in baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same estimator-plus-analytical-gradient pattern might apply to other indirect-influence settings such as long-horizon negotiation or repeated public-goods games.
Accuracy of the private policy estimators becomes the practical bottleneck; better public-observation models could further improve performance.
If the method generalizes beyond the tested domains, it could reduce the need for explicit reputation tracking mechanisms in deployed multi-agent systems.

Load-bearing premise

Private estimators of opponents' policies can be trained reliably from public observations alone, and the analytical gradient through the reputation chain captures indirect influences without large errors from model misspecification.

What would settle it

In the paper's experimental environments, if the reciprocity gradient method produces policies with no higher returns or no greater behavioral variety than sample-based baselines that collapse to constant outputs.

Figures

Figures reproduced from arXiv: 2605.08323 by Baoxiang Wang, Dan Qiao, Hongyuan Zha, Pascal Poupart, Shuhui Zhu, Wenhao Li, Yuan Liu, Yue Lin.

**Figure 1.** Figure 1: The influence attribution problem. The effects of the action will propagate throughout the entire system and eventually impact the payoff. Treating the dynamics as a black box makes it computationally intractable to update the policy. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Reciprocity gradient: interaction flow and gradient path over two timesteps. Three-agent [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Joint optimization against HybridCoop+AllD. RG reaches [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: L3 (Simple Standing) and L6 (Stern Judging) at a glance. Discrete rule tables (panel a) and the continuous tanh-relaxations actually used by the differentiable opponents (panel b). The two assessment rules agree on three of four cells; the single disagreement, the cell highlighted by the diagonal switch in L6’s bottom-right heatmap, drives the entire qualitative difference between the two norms. they read … view at source ↗

**Figure 5.** Figure 5: Computation graph of ∂ai/∂πi , where a i := a0→1 is Agent 0’s action at t=2 and π i are Agent 0’s parameters. The backward pass traverses Agent 0’s signal net (gossip emitted at t=1) before re-entering Agent 0’s action net at t=2. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: shows the subgraph reachable from a j := a1→0 (Agent 1’s action at t=1) with respect to π 0 . Unlike the previous figure, the target is an action of another agent, and Agent 0’s parameters reach it through a single inter-agent path: π 0 → a0→2 → σ 0←2 (computed by Agent 2’s gossip net) → s 0 → a1→0. This is the purest demonstration of the inter-agent reciprocity signal, threading across two agent boundarie… view at source ↗

**Figure 7.** Figure 7: Computation graph of ∂R0/∂πi , i.e. the full reciprocity gradient for Agent 0 under the fixed matching (0→2), (1→0), (0→1). All five influence channels listed in Appendix D appear as connected subgraphs sharing the three agent-colored parameter clusters. Cumulative reward. Under this matching sequence, Agent 0’s cumulative reward decomposes as R 0 = − c a0→2 + b a1→0 − c a0→1. With c = 1, b = 2 and the ran… view at source ↗

**Figure 8.** Figure 8: Single-pathway ablations (5 seeds, Touter=125). Top row. action-only training against L6, reference (b−c)/2 = 4.5. Bottom row. signal-only training against ProudCoop+AllDefector with c=5, reference (b−c)/4 = 1.25. Columns: per-interaction payoff; RG policy evolution; TD3 policy evolution; profile std for RG, TD3, and LR2 (descriptive only, not a metric); reputation under RG; reputation under TD3 and LR2. F… view at source ↗

**Figure 9.** Figure 9: Setting (A.1.i) (EMA, const05). Left: per-interaction payoff climbs from [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: Action-network recovery against L6 opponents with [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Both-network discriminative recovery under oracle opponent access. Asymmetric-LR [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

**Figure 12.** Figure 12: Action-network under learned opponent models, L6 [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Signal-network under learned opponent models. [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

**Figure 14.** Figure 14: Both-network summary across opponent sets. [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: Representative seeds on the HybridCoop+AllD joint setting ( [PITH_FULL_IMAGE:figures/full_fig_p042_15.png] view at source ↗

**Figure 16.** Figure 16: Method comparison against deterministic actor-critic baselines. [PITH_FULL_IMAGE:figures/full_fig_p045_16.png] view at source ↗

**Figure 17.** Figure 17: Gradient-norm asymmetry under direct vs. indirect matching for the joint setting. [PITH_FULL_IMAGE:figures/full_fig_p048_17.png] view at source ↗

**Figure 18.** Figure 18: Matching-regime ablation. (a) Final per-interaction payoff under direct-allowed (blue) vs. indirect-only (orange) matching across opponent settings; cooperation survives in every indirect setting. (b) Learning curves under indirect-only matching; all configurations clear the all-defect floor. (c) Cost of removing direct reciprocity in percentage points of benchmark; L6 loses < 1 pp while L3 loses 21 pp. w… view at source ↗

**Figure 19.** Figure 19: Warm-up convergence: discrete (top) and continuous (bottom) versions. L3 and L6 [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗

read the original abstract

Communication is fundamental to sustaining reciprocity and cooperation in strategic interactions. We identify and formulate the influence attribution problem as the central optimization difficulty inherent in such dynamics for a learning agent: any action or signal the agent emits reshapes the reputations of many third parties along combinatorially branching paths before feeding back into its own future rewards, forcing the agent to account for all of these indirect channels at once when choosing every action. To address this, we introduce the reciprocity gradient, which explicitly backpropagates reward gradients through private estimators of opponents' policies trained from public observations. The gradient flows through the reputation chain itself analytically, rather than being estimated from sampled returns. It jointly optimizes actions and evaluative signals without intrinsic rewards or reward shaping. Empirically, the method recovers near-optimal context-sensitive policies, while sample-based baselines collapse into constant-output policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames influence attribution in reciprocal multi-agent settings and offers an analytical gradient through learned opponent models, but the gains depend on unproven estimator stability.

read the letter

The core contribution is a clean formulation of the influence attribution problem—how one agent's signals ripple through third-party reputations before returning as rewards—and a reciprocity gradient that routes the signal analytically via differentiable private policy estimators trained on public data. That avoids the variance of pure sampling and lets the agent optimize both actions and signals jointly without hand-crafted rewards. The empirical section shows the method producing context-sensitive policies while sample-based baselines flatten out, which is a useful data point for anyone working on communication in MARL.

Referee Report

2 major / 2 minor

Summary. The paper identifies the influence attribution problem in strategic interactions with reciprocity and communication, where actions reshape third-party reputations along branching paths before feeding back into rewards. It introduces the reciprocity gradient, which backpropagates reward gradients analytically through private estimators of opponents' policies trained from public observations, jointly optimizing actions and signals without intrinsic rewards. Empirically, the approach recovers near-optimal context-sensitive policies while sample-based baselines collapse to constant outputs.

Significance. If the core assumptions hold, the reciprocity gradient provides a novel analytical tool for handling indirect reputation effects in multi-agent learning, potentially improving upon standard RL methods that rely on sampled returns. The empirical contrast with baselines is a strength, demonstrating practical recovery of context-sensitive behavior in reputation dynamics.

major comments (2)

[§4] §4 (reciprocity gradient definition): The claim of exact analytical gradient flow through the reputation chain via the chain rule is load-bearing for the central contribution, yet the manuscript provides no derivation or bound demonstrating that estimator error (from training private policy models on public observations alone) remains controlled under combinatorial branching of influence paths; if opponents' policies are non-stationary or depend on unrecoverable private state, the gradient becomes misspecified.
[§5] §5 (experiments): The reported recovery of near-optimal policies lacks quantitative metrics on private estimator accuracy (e.g., prediction error on held-out public observations) or ablation on estimator quality, making it impossible to isolate whether success stems from the analytical gradient or from other implementation details; this undermines the claim that sample-based methods fail specifically due to inability to capture indirect influences.

minor comments (2)

[Abstract] Abstract: The phrase 'reputation chain itself analytically' is introduced without a one-sentence definition or reference to the relevant equation, which would improve immediate clarity for readers unfamiliar with the influence attribution framing.
[Introduction] Notation: The manuscript would benefit from an explicit equation for the reciprocity gradient (e.g., the backpropagation expression) placed in the introduction or early method section to anchor the subsequent claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major point below and will revise the manuscript accordingly to strengthen the theoretical grounding and empirical support.

read point-by-point responses

Referee: [§4] §4 (reciprocity gradient definition): The claim of exact analytical gradient flow through the reputation chain via the chain rule is load-bearing for the central contribution, yet the manuscript provides no derivation or bound demonstrating that estimator error (from training private policy models on public observations alone) remains controlled under combinatorial branching of influence paths; if opponents' policies are non-stationary or depend on unrecoverable private state, the gradient becomes misspecified.

Authors: We agree that the manuscript would benefit from an explicit derivation and error analysis. In the revision we will add a step-by-step derivation of the reciprocity gradient via the chain rule, showing the composition through the reputation-update functions and the differentiable policy estimators. We will also include a brief analysis of estimator error under the paper's maintained assumptions (public observations suffice to recover the policy parameters relevant to the agent's own rewards, and opponents' policies are stationary over the short estimation window used for each gradient step). Under these conditions the combinatorial branching does not produce uncontrolled error because each estimator is trained on the aggregate public signal; we will state explicitly that the gradient is exact only when these assumptions hold and becomes an approximation otherwise. This clarification will be added to §4 without altering the core claim. revision: yes
Referee: [§5] §5 (experiments): The reported recovery of near-optimal policies lacks quantitative metrics on private estimator accuracy (e.g., prediction error on held-out public observations) or ablation on estimator quality, making it impossible to isolate whether success stems from the analytical gradient or from other implementation details; this undermines the claim that sample-based methods fail specifically due to inability to capture indirect influences.

Authors: We concur that additional quantitative diagnostics are needed. In the revised manuscript we will report the mean-squared prediction error of each private policy estimator on held-out public observations, together with an ablation that varies estimator quality (by subsampling the public data or injecting controlled noise). These results will be placed in §5 and will show that performance degrades gracefully with estimator error, thereby isolating the contribution of the analytical gradient. We will also strengthen the discussion of why sample-based baselines collapse, emphasizing the variance of Monte-Carlo estimates of the long, branching reputation chains versus the direct differentiation used by the reciprocity gradient. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the reciprocity gradient derivation

full rationale

The paper introduces the reciprocity gradient as an explicit backpropagation of rewards through separately trained private policy estimators from public observations, contrasting it with sample-based return estimation. This is a standard differentiable RL construction (actor-critic style) rather than a self-definitional loop where the claimed near-optimal policies are forced by the inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked in the abstract or description. The empirical claim that the method recovers context-sensitive policies while baselines collapse is presented as an experimental outcome, not a tautological renaming or fitted prediction. The derivation chain remains self-contained against external benchmarks and does not reduce any result to its own fitted components by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach relies on the domain assumption that policy estimators can capture the necessary reputation dynamics without additional invented entities.

free parameters (1)

Policy estimator parameters
The private estimators are trained from observations, introducing fitted parameters.

axioms (2)

domain assumption Reputation chains can be modeled through policy estimators
The gradient flows through the reputation chain analytically as stated.
domain assumption Public observations suffice to train accurate private estimators of opponents' policies
Explicitly stated in the abstract as the basis for the estimators.

pith-pipeline@v0.9.0 · 5450 in / 1471 out tokens · 84048 ms · 2026-05-12T01:09:57.949871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

doi: 10.1016/j.artint.2018. 01.002. Richard Alexander.The biology of moral systems. Routledge,

work page doi:10.1016/j.artint.2018 2018
[2]

Cooperation and reputa- tion dynamics with reinforcement learning.arXiv preprint arXiv:2102.07523,

Nicolas Anastassacos, Julian García, Stephen Hailes, and Mirco Musolesi. Cooperation and reputa- tion dynamics with reinforcement learning.arXiv preprint arXiv:2102.07523,

work page arXiv
[3]

Learning with opponent- learning awareness.Autonomous Agents and Multi-Agent Systems (AAMAS 2018),

J Foerster, R Chen, M Al-Shedivat, S Whiteson, P Abbeel, and I Mordatch. Learning with opponent- learning awareness.Autonomous Agents and Multi-Agent Systems (AAMAS 2018),

work page 2018
[4]

Multi- agent reinforcement learning in sequential social dilemmas.arXiv preprint arXiv:1702.03037, 2017

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas.arXiv preprint arXiv:1702.03037,

work page arXiv
[6]

Continuous control with deep reinforcement learning

URLhttps://arxiv.org/abs/1509.02971. Yue Lin, Wenhao Li, Hongyuan Zha, and Baoxiang Wang. Information design in multi-agent re- inforcement learning.Advances in Neural Information Processing Systems, 36:25584–25597,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Bottom-up reputation promotes cooperation with multi-agent reinforcement learning.arXiv preprint arXiv:2502.01971,

Tianyu Ren, Xuan Yao, Yang Li, and Xiao-Jun Zeng. Bottom-up reputation promotes cooperation with multi-agent reinforcement learning.arXiv preprint arXiv:2502.01971,

work page arXiv
[8]

Quantitative assess- ment can stabilize indirect reciprocity under imperfect information.Nature Communications, 14 (1):2086,

Laura Schmid, Farbod Ekbatani, Christian Hilbe, and Krishnendu Chatterjee. Quantitative assess- ment can stabilize indirect reciprocity under imperfect information.Nature Communications, 14 (1):2086,

work page 2086
[9]

The emergence of division of labour through decentralized social sanctioning.Proceedings of the Royal Society B: Biological Sciences, 290(2009),

Anil Yaman, Joel Z Leibo, Giovanni Iacca, and Sang Wan Lee. The emergence of division of labour through decentralized social sanctioning.Proceedings of the Royal Society B: Biological Sciences, 290(2009),

work page 2009
[10]

Talk, judge, cooperate: Gossip-driven indirect reciprocity in self- interested llm agents.arXiv preprint arXiv:2602.07777,

Shuhui Zhu, Yue Lin, Shriya Kaistha, Wenhao Li, Baoxiang Wang, Hongyuan Zha, Gillian K Hadfield, and Pascal Poupart. Talk, judge, cooperate: Gossip-driven indirect reciprocity in self- interested llm agents.arXiv preprint arXiv:2602.07777,

work page arXiv
[11]

We emphasize the assumptions each line of work places on agents, in order to make explicit where our heuristic-free formulation differs

12 A EXTENDEDRELATEDWORK This appendix expands Section 2 with more detailed discussion of learning-based and analytic ap- proaches to reputation-mediated cooperation. We emphasize the assumptions each line of work places on agents, in order to make explicit where our heuristic-free formulation differs. Differentiable inter-agent communication.Foerster et ...

work page 2016
[12]

first-order,

permits training-time access to opponents’ parameters, this assumption is justifiable only in fully cooperative settings. In general-sum games, opponents possess independent objectives, rendering their strategies inherently private. To enforce strict decentralization without privileged channels, our method adopts differentiable opponent modeling (Albrecht...

work page 2018
[13]

is evolutionarily unstable. It cannot distinguish justified defection (refusing to help a known defector) from unjustified defection (refusing to help a cooper- ator), leaving it vulnerable to invasion by unconditional cooperators and subsequent exploitation by defectors (Leimar & Hammerstein, 2001; Panchanathan & Boyd, 2003). Second-order reputation and ...

work page 2001
[14]

Theseleading eightnorms represent the minimal sufficient complexity for robust indirect reciprocity in classical models

enumerated the second-order rules that are simultaneously stable, self-consistent, and resistant to invasion by defectors, and identified exactly eight. Theseleading eightnorms represent the minimal sufficient complexity for robust indirect reciprocity in classical models. L3 (Simple Standing) and L6 (Stern Judging).The experiments in the main text use tw...

work page 2004
[15]

Equality (iii) collapses the constant prefactor in front of the sum and repackages the per-step terms back inside a single expectation

Equality (ii) moves the expectation inside the per-step sums and factors each term asE ξ[1·π] =E ξ[1]·E ξ[π] by the independence of the indicator from the reputation histories, with the indicator expectation evaluating to1/N(N−1)for every ordered pair. Equality (iii) collapses the constant prefactor in front of the sum and repackages the per-step terms ba...

work page 1998
[16]

derived the leading eight in the donation game; and subsequent extensions to private, noisy, and incomplete information (Hilbe et al., 2018; Fujimoto & Ohtsuki, 2023; Schmid et al.,

work page 2018
[17]

TX t=1 ri t # = d dθi Eξ

all retain the donation-game payoff structure. The donation game’s appeal is that it strips away direct reciprocity and isolates, in minimal parametric form, the exact question of how a third-party observer updates the donor’s reputation from a single observed action. Prisoner’s Dilemma.The Prisoner’s Dilemma (PD) is a symmetric, simultaneous game in whic...

work page 2021
[18]

// Phase 3 -- policy update via the virtual computation graph 19forn= 1, . . . , N train do 20Replay matchingM t, but for each step(j→k)withj̸=isubstitute the action byˆπ j i and the signal byˆφj i , whilei’s own action and signal are produced by the liveπ i θi , φi ηi to keep the gradient path open ati. 21Compute the virtual cumulative return ˆRi =P t ˆr...

work page 2019
[19]

backpropagates through the joint action-execution order.Opponent shap- ingmethods such as LOLA (Foerster et al., 2018), SOS (Letcher et al., 2019), and COLA (Willi et al.,

work page 2018
[20]

take asecond-ordergradient through one anticipated opponent learning step, explic- itly requiring the partial∂θ j new/∂θ i of the opponent’s update on the ego agent’s parameters. This term is only well-defined againstlearningopponents, so these methods are not applicable as base- lines in our best-response setting where opponents are fixed: the term they ...

work page 2023
[21]

Group (D) (Section J.5) tests scalability across population sizes

Group (C) (Section J.4) compares the reciprocity gradient against three deterministic model-free baselines. Group (D) (Section J.5) tests scalability across population sizes. Group (E) (Section J.6) tests robustness to indirect-only matching. H ACTION-ONLY ANDSIGNAL-ONLYPATHWAYABLATIONS This appendix complements Figure 3 (which shows only the joint traini...

work page 2004
[22]

Taxonomy of Opponent Regimes and Sanity Checks.To rigorously evaluate the reciprocity gradient, we conduct extensive experiments across a wide spectrum of opponent populations

is chosen to be fine enough that the profile standard deviation is insensitive to further refinement at the precision at which we report it. Taxonomy of Opponent Regimes and Sanity Checks.To rigorously evaluate the reciprocity gradient, we conduct extensive experiments across a wide spectrum of opponent populations. We categorize each setting into one of ...

work page 2004
[23]

Agent 0 maintains two MLP estimators per opponent (Appendix F.2) and optimizes through a virtual computation graph in which the surrogates replace the true opponent functions

occupy the remaining seats; reputations are initialized uniformly over[0,1]. Agent 0 maintains two MLP estimators per opponent (Appendix F.2) and optimizes through a virtual computation graph in which the surrogates replace the true opponent functions. Aggregator:mean. On-policy data collection concentrates near the L6 cooperative attractor and leaves the...

work page 1991
[24]

This matches the oracle-access reference from Setting (A.3) (Setting 41 (A.3) reaches99%on4/5seeds with the same shape outcome), confirming that the asymmetric- LR protocol transfers from oracle to observational opponent access without payoff loss. On the three ancillary settings (L3, L6, ProudCoop+AllD) per-interaction payoff is high (73–104%) but is ach...

work page 2014
[25]

used by our method, against the same frozen opponents from Settings (B.1)–(B.3). Per step in which Agent 0 participates, a transition (st, at, rt, st+1)is recorded with detached tensors: states t is the input passed toπ 0 orφ 0; action at =π θ(st) +εwithε∼ N(0,0.1 2)(exploration only at rollout); rewardr t =−c a t when Agent 0 donates andr t =b α donor wh...

work page 2014
[26]

References:4.5for L3/L6,1.25for the signal× PCAD (c=5) cell of Setting (B.2), and2.25for the other ProudCoop+AllD and HybridCoop+AllD cells (which usec=1)

Reciprocity- gradient values are from the corresponding settings: action×L6 atT outer = 400(10seeds, Set- ting (B.1)), signal×PCAD atT outer = 125(5seeds, Setting (B.2)), joint×HybridCoop+AllD at Touter = 200(20pooled seeds, Setting (B.3)). References:4.5for L3/L6,1.25for the signal× PCAD (c=5) cell of Setting (B.2), and2.25for the other ProudCoop+AllD an...

work page 2006
[27]

and within striking distance of mutual cooperation (R= 3); the value reflects the agent learning to inflate ProudCo- operator’s reputation enough to elicit cooperation while refusing to subsidise the AllDefector. On PD-2 (L6L6) the agent reaches3.62, exceeding the mutual-cooperation referenceR= 3because the learner periodically defects against good-standi...

work page 1991
[28]

shadow-of-the-future

is robust to individual seed variance. Public-assessment assumption.Every gossip signal attaches to the donor’s global record and is visible to all future partners. Real indirect-reciprocity systems are partially observed, with private assessment and limited information propagation (Hilbe et al., 2018; Fujimoto & Ohtsuki, 2023; 54 Schmid et al., 2023). Th...

work page 2018