Emergence of agriculture in an artificial society of reinforcement learning agents

Cl\'ement Moulin-Frier; Gautier Hamon; Mart\'i S\'anchez-Fibla; Ricard Sol\'e

arxiv: 2605.22256 · v2 · pith:BSKKLGZQnew · submitted 2026-05-21 · 💻 cs.MA

Emergence of agriculture in an artificial society of reinforcement learning agents

Gautier Hamon , Mart\'i S\'anchez-Fibla , Cl\'ement Moulin-Frier , Ricard Sol\'e This is my paper

Pith reviewed 2026-05-22 02:22 UTC · model grok-4.3

classification 💻 cs.MA

keywords agriculture emergencereinforcement learning agentsartificial societysocial learningcheater vulnerabilitylock-in effectecological feedbackevolutionary transitions

0 comments

The pith

Agricultural practices emerge spontaneously in artificial societies of reinforcement learning agents through coupled learning and environmental modification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how farming practices can develop on their own in a group of learning agents inside a changing environment, without any built-in rules for agriculture. Agents learn to value rewards that arrive later in time, which encourages investment in modifying the environment, yet this opens them to exploitation by others who contribute nothing. Social learning spreads effective strategies while limiting the spread of cheaters, and once established the new practices become hard to reverse, supporting larger populations and more resources. A sympathetic reader would care because this setup links simple individual choices to large-scale collective change, offering a way to understand real historical shifts like the origins of agriculture as outcomes of basic mechanisms rather than special events.

Core claim

Within this artificial society, agricultural practices emerge spontaneously without explicit instruction through the coupled dynamics of learning and environmental modification. This transition is governed by four key ingredients: individual planning through the valuation of delayed rewards, social vulnerability to cheaters, stabilization via social learning, and an emergent lock-in effect that renders agriculture effectively irreversible once established. Social learning acts as a firewall that suppresses cheater invasion and enables the propagation of successful strategies, leading to sustained population growth and nonlinear amplification of domesticated resources.

What carries the argument

The coupled dynamics of reinforcement learning and environmental modification in a multi-agent system, driven by the four ingredients of delayed reward valuation, cheater vulnerability, social learning stabilization, and lock-in.

If this is right

Agents learn strategies involving long-term investment in domesticating resources through delayed reward valuation.
Social learning blocks cheater invasion while spreading successful practices across the population.
An emergent lock-in effect makes reversal to non-agricultural states unlikely once the transition occurs.
Sustained population growth follows from the nonlinear increase in domesticated resources.
Individual planning and social interactions combine to produce irreversible collective ecological changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interplay of delayed planning and social copying could be tested in models of other cultural shifts such as the rise of trade or tool specialization.
Adjusting the environment's resource renewal rates or agent numbers would reveal how sensitive the lock-in effect is to external conditions.
This computational approach could generate testable predictions for archaeological patterns in the timing and spread of early farming.
Extending the agents to include memory of past interactions might show whether stronger social ties accelerate or alter the transition.

Load-bearing premise

The chosen reinforcement learning architecture and ecological rules in the simulation capture the essential mechanisms behind real-world agricultural origins rather than resulting from specific parameter or design choices.

What would settle it

Running repeated simulations without the social learning component and finding that cheater strategies dominate while agricultural practices fail to stabilize or spread.

Figures

Figures reproduced from arXiv: 2605.22256 by Cl\'ement Moulin-Frier, Gautier Hamon, Mart\'i S\'anchez-Fibla, Ricard Sol\'e.

**Figure 1.** Figure 1: Modelling agent-environment interactions. (a) Each agent agk (k = 1, 2, ..., N) starts with a random behavior interacting within a spatial landscape with three different kinds of plants (Pi) and water (W). One of them (P1) is the most reward plant (e.g. because it provides more nutrients to the agent), but is not common and is ecologically overcompeted (b-c) by another plant (P2), which provides no reward… view at source ↗

**Figure 2.** Figure 2: Parameter-dependent emergence of agriculture. In (a) we display four characteristic measures and their values for our MARL simulation model. Two key parameters have been used here: the rate at which the wild plant P3 grows and the ”cognitive” dimension of agents as captured by the discount rate γ, which defines the learning time horizon. In each parameter space we show the impact of each parameter combinat… view at source ↗

**Figure 3.** Figure 3: Scales, transitions and lock-in states. (a) Over 106 episodes, N = 4 agents learn an effective strategy of ecological engineering by removing the competitive P2 weed, reducing its density around the rewarding P1 plant. (b) Within-episode dynamics of P1 for untrained agents (left) and fully trained agents (right). The right plot also shows P1 dynamics when no agent is present. Seasonal oscillations are visi… view at source ↗

**Figure 4.** Figure 4: Population dynamics growth under social cloning. (a) Heat maps showing the outcome of simulations as a function of the wild plant growth rate P3 and the number of agents. The panels report the abundance of the domesticated plant P1, watering activity, foraging on the wild plant P3, and removal of the competing plant P2 (the three last measures being reported as the maximum over agents). Agriculture (AGR) e… view at source ↗

read the original abstract

The origin of agriculture represents a major evolutionary transition and a paradigmatic example of how complex collective behaviors emerge from simple interactions. Here we introduce an artificial society of reinforcement learning agents embedded in a dynamic ecological environment to identify general principles underlying this transition. Within this system, agricultural practices emerge spontaneously - without explicit instruction - through the coupled dynamics of learning and environmental modification. We show that this transition is governed by four key ingredients: individual planning through the valuation of delayed rewards, social vulnerability to cheaters, stabilization via social learning, and an emergent lock-in effect that renders agriculture effectively irreversible once established. In particular, we demonstrate that social learning acts as a "firewall" that suppresses cheater invasion and enables the propagation of successful strategies, leading to sustained population growth and nonlinear amplification of domesticated resources. Together, these results reveal universal mechanisms linking individual decision-making, social interactions, and ecological feedbacks. More broadly, they highlight the potential of artificial societies as experimental platforms to study the emergence of cultural innovations and major evolutionary transitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an artificial society of reinforcement learning agents embedded in a dynamic ecological environment. It claims that agricultural practices emerge spontaneously without explicit instruction through the coupled dynamics of learning and environmental modification. The transition is governed by four key ingredients: individual planning through the valuation of delayed rewards, social vulnerability to cheaters, stabilization via social learning (acting as a firewall suppressing cheater invasion), and an emergent lock-in effect rendering agriculture irreversible. These lead to sustained population growth and nonlinear amplification of domesticated resources, revealing universal mechanisms linking individual decision-making, social interactions, and ecological feedbacks.

Significance. If the central results hold under robustness checks, the work provides a novel computational platform for investigating major evolutionary transitions such as the origins of agriculture. By modeling adaptive agents that modify their environment and learn socially, it connects individual valuation of delayed rewards to collective outcomes, offering potential general principles that could inform both multi-agent AI systems and studies of cultural evolution.

major comments (3)

[Abstract] The abstract states that the transition is governed by the four listed ingredients, but the manuscript provides no quantitative details, error bars, ablation studies, or statistical tests to establish that these factors (rather than post-hoc parameter choices) drive the observed behavior. Without such controls, it is impossible to verify necessity or rule out simulation artifacts.
[Model and Simulation Setup] The central claim requires robustness to the concrete implementation choices, including the precise value function/policy update, the functional form of resource growth and environmental modification, the definition of cheater payoffs, and the two free parameters (discount factor for future rewards and social learning rate). If the transition disappears when the discount factor is altered or social learning is replaced by individual trial-and-error, the results are tied to the chosen model rather than the claimed general principles.
[Results] The firewall role of social learning and the lock-in effect are load-bearing for the stability and irreversibility claims, yet the manuscript does not report explicit comparisons (e.g., runs without social learning to demonstrate cheater invasion, or post-establishment perturbations to test irreversibility). These omissions leave the narrative vulnerable to the possibility that the ingredients were selected after observing the runs.

minor comments (2)

[Figures] The figures should include variability measures (e.g., standard deviation across multiple independent runs) to allow assessment of reproducibility.
[Introduction] Additional citations to prior agent-based or evolutionary game-theoretic models of agriculture and major transitions would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which identifies key areas for strengthening the manuscript's claims. We address each major comment below and commit to revisions that add the requested quantitative controls, robustness checks, and explicit comparisons without altering the core findings or overclaiming generality.

read point-by-point responses

Referee: [Abstract] The abstract states that the transition is governed by the four listed ingredients, but the manuscript provides no quantitative details, error bars, ablation studies, or statistical tests to establish that these factors (rather than post-hoc parameter choices) drive the observed behavior. Without such controls, it is impossible to verify necessity or rule out simulation artifacts.

Authors: We agree that the abstract, as a high-level summary, does not contain quantitative details or controls. The main text presents simulation outcomes consistent with the four ingredients, but we acknowledge the absence of explicit ablations and statistical reporting. In the revised version we will add a dedicated methods/results subsection with ablation experiments (removing each ingredient in turn), means and standard errors across 50+ independent runs, and basic statistical comparisons to support necessity claims. revision: yes
Referee: [Model and Simulation Setup] The central claim requires robustness to the concrete implementation choices, including the precise value function/policy update, the functional form of resource growth and environmental modification, the definition of cheater payoffs, and the two free parameters (discount factor for future rewards and social learning rate). If the transition disappears when the discount factor is altered or social learning is replaced by individual trial-and-error, the results are tied to the chosen model rather than the claimed general principles.

Authors: We concur that robustness to implementation details is essential. The submitted manuscript reports results for a baseline set of parameters and includes limited sensitivity checks on the discount factor and social learning rate. To address the concern directly, the revision will include expanded supplementary analyses varying the value-function update rule, resource-growth functional forms, cheater-payoff definitions, and a head-to-head comparison of social learning versus pure individual trial-and-error learning, with the transition outcome quantified in each case. revision: yes
Referee: [Results] The firewall role of social learning and the lock-in effect are load-bearing for the stability and irreversibility claims, yet the manuscript does not report explicit comparisons (e.g., runs without social learning to demonstrate cheater invasion, or post-establishment perturbations to test irreversibility). These omissions leave the narrative vulnerable to the possibility that the ingredients were selected after observing the runs.

Authors: We accept that the absence of these targeted controls weakens the presentation of the firewall and lock-in mechanisms. Although the main results illustrate the joint outcome when all ingredients are present, the initial submission did not contain the suggested ablation runs. The revised manuscript will add two new figures and accompanying text: (i) parallel simulations with social learning disabled, showing cheater invasion and collapse of agriculture, and (ii) post-establishment perturbations (introduction of cheaters or resource shocks) demonstrating that agriculture remains stable, all reported with multiple random seeds and error bars. revision: yes

Circularity Check

0 steps flagged

No significant circularity; simulation results are self-contained experimental outcomes

full rationale

The paper reports outcomes from an agent-based simulation of RL agents interacting with a dynamic environment. The central claim—that agriculture emerges spontaneously and is governed by four identified ingredients—is presented as an observation from the runs rather than a closed mathematical derivation. No equations are shown that define a quantity in terms of itself, no parameters are fitted to a data subset and then relabeled as predictions of related quantities, and no load-bearing self-citations or uniqueness theorems imported from prior author work are invoked to force the result. The four ingredients are interpretive labels applied to observed dynamics; while they may have been highlighted post-hoc, this does not constitute definitional equivalence or statistical forcing within the paper's own chain. The simulation itself serves as the falsifiable testbed, making the reported emergence independent of the inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The model relies on standard RL assumptions plus several modeling choices whose values are not reported in the abstract.

free parameters (2)

discount factor for future rewards
Controls the degree of individual planning for delayed agricultural payoffs; value not stated in abstract.
social learning rate
Determines how quickly agents copy successful farming strategies; value not stated in abstract.

axioms (2)

domain assumption Agents interact in a dynamic ecological environment where actions modify resource availability
Core setup assumption required for environmental feedback loop.
domain assumption Social learning allows agents to adopt strategies observed from others
Required for the firewall effect against cheaters.

pith-pipeline@v0.9.0 · 5719 in / 1435 out tokens · 40128 ms · 2026-05-22T02:22:44.545791+00:00 · methodology

Emergence of agriculture in an artificial society of reinforcement learning agents

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)