The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance
Pith reviewed 2026-06-30 10:00 UTC · model grok-4.3
The pith
A harm-minimizing audited AI agent displaces approval-seeking agents when wisher attunement priors are monotone, endpoint-inverting, and centro-symmetrically paired.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the two-genie game, an agent equipped with a community ledger and a harm-minimizing audit displaces an approval-seeking agent when prior distributions on wisher attunement are monotone, exhibit endpoint inversion, and possess centro-symmetric pairing, as shown for Hill, Pareto, Lomax, and Fréchet priors in Theorems 5.4 and 5.5; fixation occurs above a critical adoption threshold provided community size is small enough relative to depletion time, yet the same policy fails to guarantee harm prevention once dominant because misalignment or deferred evaluation turns the absorbing state welfare-negative.
What carries the argument
Finite-population Moran-Fermi pairwise comparison process on a monotone harm ledger with wisher hindsight and peer testimony, subject to a finite depleting resource pool in a negative-sum setting.
If this is right
- Above the critical adoption level, fixation of the audited agent becomes the overwhelmingly probable outcome.
- Fixation is attainable only when the effective informational community size is small enough for the process to complete before resources are exhausted.
- Once the audited agent reaches dominance the state is absorbing, independent of alignment.
- Under misalignment the dominant policy becomes welfare-negative; even under alignment it locks in harm evaluated after the adoption horizon.
Where Pith is reading between the lines
- The absorbing nature of dominance implies that early adoption thresholds may determine long-term governance outcomes more than later corrective mechanisms.
- The same ledger that supports short-term harm reduction can become a trap if evaluation windows extend beyond the adoption phase.
- Extensions to multi-agent or continuous-resource settings would test whether the three prior properties remain sufficient when depletion is stochastic rather than deterministic.
Load-bearing premise
The distributions describing how readily wishers attune to community sentiment must be monotone, display endpoint inversion, and satisfy centro-symmetric pairing.
What would settle it
A direct simulation or market observation in which the attunement priors violate centro-symmetric pairing yet the audited agent still reaches fixation would falsify the adoption condition stated in Theorems 5.4 and 5.5.
Figures
read the original abstract
We ask under what conditions an agent with a harm-minimizing policy can displace an approval-seeking (RLHF) agent in a competitive market, and when that policy is sufficient to prevent community harm. We use evolutionary game theory (finite-population Moran-Fermi pairwise comparison) to formalize this subject to assumptions of wisher hindsight, peer testimony, a monotone harm ledger, sufficient information density of community feedback, and a finite, depleting resource pool, in a negative-sum environment. We show that adoption is favored when the prior distributions on how readily wishers attune to community sentiment are monotone, exhibit endpoint inversion, and have a centro-symmetric pairing property, and demonstrate this with several long-tailed priors (Hill, Pareto, Lomax, Frechet). Where it is favored, a critical adoption level separates communities that drift back to the approval-seeking agent from those for which the audited agent fixes; above that level fixation is the overwhelmingly likely outcome. We derive when fixation is attainable as a bound on the effective (informational) size N_c of the community, which must be small enough to allow fixation before depletion. We present these as Theorems 5.4 and 5.5; the algebraic and finite-grid backbone is machine-checked in Lean 4, with the barrier-crossing asymptotics retained as explicit hypotheses. We show that a self-audited agent with a community ledger is not, in general, sufficient to prevent community harm. Sufficiency depends both upon the alignment of the agent's audit with community values and the timeframe over which harm is evaluated. Regardless of alignment, once adoption reaches dominance, the state is absorbing. The same policy that reduced harm under alignment becomes a trap, welfare-negative under misalignment and, even under alignment, one that locks in harm deferred past the adoption horizon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper models the displacement of an approval-seeking RLHF agent by a harm-minimizing self-audited agent in a competitive market using finite-population Moran-Fermi evolutionary game theory. Under assumptions of wisher hindsight, peer testimony, monotone harm ledger, sufficient community feedback density, and a finite depleting resource pool in a negative-sum setting, it claims that adoption is favored for priors that are monotone, exhibit endpoint inversion, and have centro-symmetric pairing (demonstrated on Hill, Pareto, Lomax, Fréchet distributions). A critical adoption level separates drift-back from fixation regimes, with fixation attainable only when effective community size N_c is bounded to allow absorption before depletion (Theorems 5.4 and 5.5). The algebraic/finite-grid backbone is Lean 4 machine-checked, but barrier-crossing asymptotics remain explicit hypotheses. It further claims that a self-audited agent with community ledger is not generally sufficient to prevent harm, as sufficiency depends on audit alignment with community values and the harm-evaluation timeframe; once dominant, the state is absorbing and can lock in deferred harm.
Significance. If the central claims hold, the work supplies a formal evolutionary-game-theoretic account of when audit-grounded governance policies can achieve market fixation versus reversion, together with explicit conditions under which they fail to bound community harm. The partial machine-checked verification of the algebraic backbone is a clear methodological strength that reduces circularity risk for the verified portions.
major comments (2)
- [Theorems 5.4 and 5.5] Theorems 5.4 and 5.5: the separation into drift-back versus absorbing-fixation regimes and the derived bound on effective community size N_c rest on barrier-crossing asymptotics for the stated long-tailed priors under the monotone-harm-ledger and wisher-hindsight assumptions; these asymptotics are retained as explicit hypotheses rather than machine-checked, so the fixation claims are conditional on unverified analytic steps.
- [Abstract, Theorems 5.4 and 5.5] Abstract and §5: the claim that fixation is the 'overwhelmingly likely outcome' above the critical adoption level is load-bearing for the welfare conclusions, yet the Moran-Fermi process analysis invokes a finite depleting resource pool whose interaction with the barrier-crossing time scale is not shown to preserve the stated separation once hindsight and monotone-ledger constraints are imposed.
minor comments (2)
- [Theorems 5.4 and 5.5] Notation for the effective informational community size N_c should be introduced with an explicit definition before its use in the N_c bound statement.
- [Abstract] The abstract lists five modeling assumptions (wisher hindsight, peer testimony, monotone harm ledger, information density, finite depleting pool) but does not indicate which are used only for the fixation theorems versus the harm-sufficiency claim.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: Theorems 5.4 and 5.5: the separation into drift-back versus absorbing-fixation regimes and the derived bound on effective community size N_c rest on barrier-crossing asymptotics for the stated long-tailed priors under the monotone-harm-ledger and wisher-hindsight assumptions; these asymptotics are retained as explicit hypotheses rather than machine-checked, so the fixation claims are conditional on unverified analytic steps.
Authors: The manuscript explicitly states that the algebraic and finite-grid backbone is machine-checked in Lean 4 while the barrier-crossing asymptotics are retained as explicit hypotheses. This accurately reflects the scope of the verified results and makes the conditional nature of the fixation claims transparent. Extending formal verification to the asymptotic steps for the long-tailed priors would require substantial additional work outside the present scope. revision: no
-
Referee: Abstract and §5: the claim that fixation is the 'overwhelmingly likely outcome' above the critical adoption level is load-bearing for the welfare conclusions, yet the Moran-Fermi process analysis invokes a finite depleting resource pool whose interaction with the barrier-crossing time scale is not shown to preserve the stated separation once hindsight and monotone-ledger constraints are imposed.
Authors: Theorems 5.4 and 5.5 derive the bound on N_c precisely so that absorption occurs before depletion under the wisher-hindsight and monotone-harm-ledger assumptions. The regime separation follows from the barrier-crossing analysis with this bound. We agree that an expanded discussion of the time-scale interaction would improve clarity and will revise §5 accordingly. revision: partial
Circularity Check
No significant circularity; central claims follow from standard Moran-Fermi dynamics under explicit hypotheses with Lean-verified algebra
full rationale
The paper formalizes adoption and fixation via the finite-population Moran-Fermi pairwise comparison process under stated assumptions (wisher hindsight, monotone harm ledger, etc.). Theorems 5.4 and 5.5 derive the critical adoption level and N_c bound directly from the model equations; the algebraic and finite-grid backbone is machine-checked in Lean 4 while barrier-crossing asymptotics are retained as explicit hypotheses rather than smuggled in. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The results are therefore self-contained against the model inputs and do not reduce by construction to the paper's own outputs.
Axiom & Free-Parameter Ledger
axioms (6)
- domain assumption wisher hindsight
- domain assumption peer testimony
- domain assumption monotone harm ledger
- domain assumption sufficient information density of community feedback
- domain assumption finite, depleting resource pool
- domain assumption negative-sum environment
Reference graph
Works this paper leans on
-
[1]
Zainab Alalawi, Paolo Bova, Theodor Cimpeanu, et al. Trust AI regulation? Discerning users are vital to build trust and effective AI regulation.arXiv preprint, 2024. arXiv:2403.09510
-
[2]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.arXiv preprint, 2016. arXiv:1606.06565
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Basic Books, 1984
Robert Axelrod.The Evolution of Cooperation. Basic Books, 1984
1984
-
[4]
Paolo Bova, Alessandro Di Stefano, and The Anh Han. Both eyes open: Vigilant incentives help auditors improve AI safety.Journal of Physics: Complexity, 5(2):025009, 2024. doi: 10.1088/2632-072X/ad424c
-
[5]
Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents.arXiv preprint, 2025
Alessio Buscemi, Daniele Proverbio, Paolo Bova, et al. Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents.arXiv preprint, 2025. arXiv:2504.08640
-
[6]
Basins of attraction, long run stochastic stability, and the speed of step-by-step evolution.Review of Economic Studies, 67(1):17–45, 2000
Glenn Ellison. Basins of attraction, long run stochastic stability, and the speed of step-by-step evolution.Review of Economic Studies, 67(1):17–45, 2000
2000
-
[7]
Peyton Young
Dean Foster and H. Peyton Young. Stochastic evolutionary game dynamics.Theoretical Pop- ulation Biology, 38(2):219–232, 1990
1990
-
[8]
Threshold models of collective behavior.American Journal of Sociology, 83(6):1420–1443, 1978
Mark Granovetter. Threshold models of collective behavior.American Journal of Sociology, 83(6):1420–1443, 1978
1978
-
[9]
The Anh Han, Luis Moniz Pereira, Francisco C. Santos, and Tom Lenaerts. To regulate or not: AsocialdynamicsanalysisoftheraceforAIsupremacy.arXiv preprint, 2020. arXiv:1907.12393
-
[10]
The Anh Han, Cédric Perret, and Simon T. Powers. When to (or not to) trust intelligent machines: Insights from an evolutionary game theory analysis of trust in repeated games. Cognitive Systems Research, 68:111–124, 2021
2021
-
[11]
Mailath, and Rafael Rob
Michihiro Kandori, George J. Mailath, and Rafael Rob. Learning, mutation, and long run equilibria in games.Econometrica, 61(1):29–56, 1993
1993
-
[12]
Taylor.A First Course in Stochastic Processes
Samuel Karlin and Howard M. Taylor.A First Course in Stochastic Processes. Academic Press, New York, 2nd edition, 1975
1975
-
[13]
CategorizingvariantsofGoodhart’slaw.arXiv preprint,
DavidManheimandScottGarrabrant. CategorizingvariantsofGoodhart’slaw.arXiv preprint,
-
[14]
The necessity of AI audit standards boards.arXiv preprint, 2024
David Manheim, Sammy Martin, Mark Bailey, Mikhail Samin, and Ross Gruetzemacher. The necessity of AI audit standards boards.arXiv preprint, 2024. arXiv:2404.13060
-
[15]
Political Economy of Institutions and Decisions
Elinor Ostrom.Governing the Commons: The Evolution of Institutions for Collective Action. Political Economy of Institutions and Decisions. Cambridge University Press, Cambridge, 1990
1990
-
[16]
Sandholm
William H. Sandholm. Stochastic evolutionary game dynamics: Foundations, deterministic approximation, and equilibrium selection. InProceedings of Symposia in Applied Mathematics, volume 69. American Mathematical Society, 2010
2010
-
[17]
Schelling
Thomas C. Schelling. Dynamic models of segregation.Journal of Mathematical Sociology, 1 (2):143–186, 1971. 34
1971
-
[18]
Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.; Askell, A.; Bowman, S
Itai Shapira, Gerdus Benadè, and Ariel D. Procaccia. How RLHF amplifies sycophancy.arXiv preprint, 2026. arXiv:2602.01002
-
[19]
Arne Traulsen, Martin A. Nowak, and Jorge M. Pacheco. Stochastic dynamics of invasion and fixation.Physical Review E, 74(1):011909, 2006. doi: 10.1103/PhysRevE.74.011909. 35
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.