pith. sign in

arxiv: 2508.17599 · v2 · pith:47MPOQLVnew · submitted 2025-08-25 · 🧬 q-bio.PE · cond-mat.dis-nn· nlin.AO

Decoding species coexistence: A reinforcement learning perspective

Pith reviewed 2026-05-21 23:30 UTC · model grok-4.3

classification 🧬 q-bio.PE cond-mat.dis-nnnlin.AO
keywords species coexistencerock-paper-scissorsreinforcement learningQ-learningspatial ecologybiodiversity maintenanceadaptive mobility
0
0 comments X

The pith

In a spatial rock-paper-scissors model, mobility adaptively regulated by reinforcement learning allows all three species to coexist stably across broad migration rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies a reinforcement learning framework to a spatial rock-paper-scissors model to resolve why high mobility does not always eliminate biodiversity as earlier fixed-mobility theories predicted. Individuals use Q-learning to adjust movement in response to local encounters rather than following preset rates. This produces stable coexistence with low extinction risk over wide baseline migration values. The mechanism hinges on learned behaviors that balance predator avoidance with prey pursuit.

Core claim

When mobility is adaptively regulated via a Q-learning algorithm in a spatial RPS model, all three species coexist stably with low extinction probabilities across a broad range of baseline migration rates. Individuals develop survival priority by escaping predators and predation priority by remaining near prey. Coexistence arises from the balance of these tendencies; imbalance jeopardizes biodiversity. A symmetry-breaking of action preference in a particular state accounts for divergent species densities. Q-learning species show a significant evolutionary advantage when interacting with fixed-mobility counterparts.

What carries the argument

Q-learning algorithm that adaptively regulates individual mobility based on local predator-prey encounters.

If this is right

  • Coexistence remains stable over wide migration rates because learned behaviors balance escape and pursuit.
  • Imbalance between survival priority and predation priority drives biodiversity loss.
  • Symmetry-breaking in action preference in one state produces unequal species densities.
  • Adaptive-mobility species outcompete fixed-mobility species in direct interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Natural populations may evolve learning rules for movement that promote long-term diversity in competitive settings.
  • The approach could be tested in other cyclic competition models to check whether adaptive regulation generally supports coexistence.
  • Conservation planning might prioritize habitat features that enable behavioral adjustment over simply restricting overall movement.

Load-bearing premise

The Q-learning process accurately represents how real organisms adjust mobility in response to local predator-prey encounters.

What would settle it

Direct observation or measurement of movement rates in natural three-species cyclic systems showing whether individuals increase movement away from predators and toward prey in proportions matching the model's learned action preferences at high mobility.

Figures

Figures reproduced from arXiv: 2508.17599 by Chenyang Zhao, Jiqiang Zhang, Kaiwen Jiang, Li Chen, Shengfeng Deng, Weiran Cai.

Figure 1
Figure 1. Figure 1: , which occur at a rate σ. Reaction 2 shows the reproduction process with a rate µ, which can only take place when an adjacent site is empty. Reaction 3 rep￾resents the migration process with an exchange rate ε0. Following these reactions, one can monitor the density evolution of the three species and study the impact of these parameters, e.g., with the Gillespie algorithm [26]. Typically, the predation an… view at source ↗
Figure 2
Figure 2. Figure 2: (b, c) provide the typical time series for the two scenarios at M0 = 3 × 10−4 of all densities ρi , where i ∈ {A, B, C, ∅} [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIG. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIG. 7 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIG. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIG. 10 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

A central goal in ecology is to understand how biodiversity is maintained. Previous theoretical works have employed the rock-paper-scissors (RPS) game as a toy model, demonstrating that population mobility is crucial in determining the species' coexistence. One key prediction is that biodiversity is jeopardized and eventually lost when mobility exceeds a certain value--a conclusion at odds with empirical observations of highly mobile species coexisting in nature. To address this discrepancy, we introduce a reinforcement learning framework and study a spatial RPS model, where individual mobility is adaptively regulated via a Q-learning algorithm rather than held fixed. Our results show that all three species can coexist stably, with extinction probabilities remaining low across a broad range of baseline migration rates. Mechanistic analysis reveals that individuals develop two behavioral tendencies: survival priority (escaping from predators) and predation priority (remaining near prey). While species coexistence emerges from the balance of the two tendencies, their imbalance jeopardizes biodiversity. Notably, there is a symmetry-breaking of action preference in a particular state that is responsible for the divergent species densities. Furthermore, when Q-learning species interact with fixed-mobility counterparts, those with adaptive mobility exhibit a significant evolutionary advantage. Our study suggests that reinforcement learning may offer a promising new perspective for uncovering the mechanisms of biodiversity and informing conservation strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a reinforcement learning (Q-learning) framework to model adaptive individual mobility in a spatial rock-paper-scissors (RPS) ecological model. Unlike fixed-mobility versions where high mobility leads to biodiversity loss, the adaptive model allows stable coexistence of all three species, with low extinction probabilities maintained across a broad range of baseline migration rates. Mechanistic analysis identifies survival-priority and predation-priority behaviors, symmetry-breaking in action preferences that drives density differences, and an evolutionary advantage for Q-learning agents when competing against fixed-mobility counterparts.

Significance. If the reported simulation outcomes are robust, the work provides a novel mechanistic perspective on biodiversity maintenance that reconciles theoretical mobility thresholds with empirical observations of highly mobile coexisting species. The integration of reinforcement learning to derive adaptive behavioral rules from local interactions is a clear strength and could inform future agent-based ecological models and conservation applications.

major comments (2)
  1. [Results / Simulation protocol] The central claim that extinction probabilities remain low for all three species across a broad range of baseline migration rates rests on the simulation protocol correctly sampling the tail of the extinction-time distribution. The abstract and available description supply no information on the number of independent Monte Carlo replicates, total run length, burn-in period, or convergence diagnostics for the probability estimates; if these are modest or short relative to typical extinction timescales at high mobility, rare extinctions could be missed, undermining the stability conclusion (see skeptic note on underpowered estimates).
  2. [Methods / Q-learning details] The Q-learning implementation introduces free parameters (learning rate and exploration parameter) whose specific values and sensitivity are not reported. Because the coexistence result is obtained from agent-based simulations rather than an algebraic reduction, it is necessary to demonstrate that the low-extinction outcome is not an artifact of particular hyperparameter choices or post-hoc tuning.
minor comments (2)
  1. Clarify the precise mapping between the 'baseline migration rate' parameter and the adaptive mobility output of the Q-learning algorithm; this notation is used in the abstract but its operational definition is unclear from the summary.
  2. Add error bars, confidence intervals, or replicate variability to any figures reporting extinction probabilities or species densities to allow visual assessment of robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below and have revised the manuscript to incorporate additional details on simulation protocols and Q-learning hyperparameters, along with supporting analyses to strengthen the robustness claims.

read point-by-point responses
  1. Referee: [Results / Simulation protocol] The central claim that extinction probabilities remain low for all three species across a broad range of baseline migration rates rests on the simulation protocol correctly sampling the tail of the extinction-time distribution. The abstract and available description supply no information on the number of independent Monte Carlo replicates, total run length, burn-in period, or convergence diagnostics for the probability estimates; if these are modest or short relative to typical extinction timescales at high mobility, rare extinctions could be missed, undermining the stability conclusion (see skeptic note on underpowered estimates).

    Authors: We agree that explicit reporting of the simulation protocol is necessary for assessing the reliability of the extinction probability estimates. In the revised manuscript, we will add a new subsection in the Methods section specifying the number of independent Monte Carlo replicates (1000 per parameter combination), total simulation length (10^6 time steps after a 10^5-step burn-in), and convergence checks (monitoring stabilization of species densities and extinction event counts). These parameters were chosen to exceed typical extinction timescales observed in fixed-mobility controls, ensuring adequate sampling of rare events. We will also include supplementary figures showing cumulative extinction probability convergence over replicate count. revision: yes

  2. Referee: [Methods / Q-learning details] The Q-learning implementation introduces free parameters (learning rate and exploration parameter) whose specific values and sensitivity are not reported. Because the coexistence result is obtained from agent-based simulations rather than an algebraic reduction, it is necessary to demonstrate that the low-extinction outcome is not an artifact of particular hyperparameter choices or post-hoc tuning.

    Authors: We acknowledge the importance of demonstrating robustness to hyperparameter choices. The revised manuscript will explicitly report the values used (learning rate α = 0.1, exploration rate ε = 0.05 with linear decay) in the Methods. We will add a new supplementary section with sensitivity analyses varying α from 0.01 to 0.5 and ε from 0.01 to 0.2, showing that stable coexistence with low extinction probabilities persists across this range. These results confirm that the reported outcomes are not sensitive to specific tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: coexistence claims arise from independent agent-based RL simulations

full rationale

The paper's central results on stable coexistence and low extinction probabilities are generated by running Q-learning agents in a spatial RPS lattice model. No algebraic derivation chain is presented that reduces the reported outcomes to fitted parameters, self-definitions, or prior self-citations. The adaptive mobility behaviors and symmetry-breaking are observed outputs of the simulation protocol rather than inputs renamed as predictions. The work is self-contained against external benchmarks because the RL update rules and mobility adaptation are defined independently of the final extinction statistics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model rests on standard RPS interaction rules and typical Q-learning update mechanics; no new physical entities are postulated, but several algorithmic parameters remain unspecified in the abstract.

free parameters (2)
  • Q-learning rate
    Standard hyperparameter controlling how quickly action values are updated; value and selection method not stated in abstract.
  • exploration parameter
    Controls balance between trying new actions and using learned ones; tuning details absent from abstract.
axioms (1)
  • domain assumption Species interactions follow cyclic dominance on a spatial lattice with local movement decisions.
    Invoked as the base model structure throughout the abstract.

pith-pipeline@v0.9.0 · 5774 in / 1291 out tokens · 52756 ms · 2026-05-21T23:30:08.198702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A brief review of evolutionary game dynamics in the reinforcement learning paradigm

    q-bio.PE 2026-02 unverdicted novelty 2.0

    A review synthesizing how reinforcement learning in evolutionary games provides a unified framework for social and ecological phenomena beyond traditional imitation models.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 1 Pith paper

  1. [1]

    Initialize all the items of the three Q-tables with random num- bers Qs,a ∈ (0, 1) independently to mimic the un- awareness of individuals to the surroundings

    Each site of the lattice is randomly occupied by an individual of A, B, C, or left empty. Initialize all the items of the three Q-tables with random num- bers Qs,a ∈ (0, 1) independently to mimic the un- awareness of individuals to the surroundings. Each player i takes a random migration rate withai ∈ A

  2. [2]

    Their states also need to be updated

    In the learning process, each agent’s action is made by pure exploration ai ∈ A ; afterwards, their re- wards are obtained by collecting payoffs, and then they update their Q-tables to accumulate their ex- perience. Their states also need to be updated

  3. [3]

    Their migration is then strictly guided by the corresponding Q-table belonging to their species, and the three Q-tables are no longer revised

    After the three Q-tables are converged, the game process starts. Their migration is then strictly guided by the corresponding Q-table belonging to their species, and the three Q-tables are no longer revised. Repeat step 2 till the convergence of three Q-tables, which completes the learning process. Repeat step 3 un- til the system reaches a statistically ...

  4. [4]

    M. E. Assessment, Ecosystems and human well-being: current state and trends: findings of the Condition and Trends Working Group (Island press, 2005)

  5. [5]

    Darwin, On the origin of species, 1859 (Routledge London, UK:, 2004)

    C. Darwin, On the origin of species, 1859 (Routledge London, UK:, 2004)

  6. [6]

    Pennisi, Science 309, 90 (2005)

    E. Pennisi, Science 309, 90 (2005)

  7. [7]

    Government, The Economics of Biodiversity: The Dasgupta Review (UK Government, 2021)

    U. Government, The Economics of Biodiversity: The Dasgupta Review (UK Government, 2021)

  8. [8]

    May and A

    R. May and A. R. McLean, Theoretical ecology: princi- ples and applications (OUP Oxford, 2007)

  9. [9]

    J. D. Murray, Mathematical biology: I. An introduction 3rd ed. , Vol. 17 (Springer Science & Business Media, 2013)

  10. [10]

    C. L. Lehman and D. Tilman, Spatial ecology: the role of space in population dynamics and interspecific inter- actions 185, 191 (1997)

  11. [11]

    A. J. McLane, C. Semeniuk, G. J. McDermid, and D. J. Marceau, Ecological modelling 222, 1544 (2011)

  12. [12]

    J. M. Smith, Evolution and the Theory of Games (Cam- bridge Univ. Press, 1982)

  13. [13]

    Hofbauer and K

    J. Hofbauer and K. Sigmund, Evolutionary Games and Population Dynamics (Cambridge University Press, Cambridge, 1998)

  14. [14]

    M. A. Nowak, Evolutionary Dynamics (Belknap Press, Cambridge, MA, 2006)

  15. [15]

    Durrett and S

    R. Durrett and S. Levin, Journal of Theoretical Biology 185, 165 (1997)

  16. [16]

    Durrett and S

    R. Durrett and S. Levin, Theoretical Population Biology 53, 30 (1998)

  17. [17]

    B. Kerr, C. Neuhauser, B. J. M. Bohannan, and A. M. Dean, Nature 418, 171 (2002)

  18. [19]

    R. M. May and W. J. Leonard, SIAM Journal on Applied Mathematics 29, 243 (1975)

  19. [20]

    C. R. Johnson and I. Seinen, Proceedings of the Royal Society of London. Series B: Biological Sciences 269, 655 (2002)

  20. [21]

    Reichenbach, M

    T. Reichenbach, M. Mobilia, and E. Frey, Physical Re- view E 74, 051907 (2006)

  21. [22]

    Szab´ o and G

    G. Szab´ o and G. F´ ath, Physics Reports446, 97 (2007)

  22. [23]

    Szolnoki, M

    A. Szolnoki, M. Mobilia, L.-L. Jiang, B. Szczesny, A. M. Rucklidge, and M. Perc, Journal of the Royal Society Interface 11, 20140735 (2014)

  23. [24]

    Zhou, Contemporary Physics 57, 151 (2016)

    H.-J. Zhou, Contemporary Physics 57, 151 (2016)

  24. [25]

    Sinervo and C

    B. Sinervo and C. M. Lively, Nature 380, 240 (1996)

  25. [26]

    C. E. Paquin and J. Adams, Nature 306, 368 (1983)

  26. [27]

    Jackson and L

    J. Jackson and L. Buss, Proceedings of the National Academy of Sciences 72, 5160 (1975)

  27. [28]

    T. L. Cz´ ar´ an, R. F. Hoekstra, and L. Pagie, Proceedings of the National Academy of Sciences 99, 786 (2002)

  28. [29]

    Reichenbach, M

    T. Reichenbach, M. Mobilia, and E. Frey, Nature 448, 1046 (2007)

  29. [30]

    Yang, W.-X

    R. Yang, W.-X. Wang, Y.-C. Lai, and C. Grebogi, Chaos 20, 023113 (2010)

  30. [31]

    Wang, Y.-C

    W.-X. Wang, Y.-C. Lai, and C. Grebogi, Physical Re- view E 81, 046113 (2010)

  31. [32]

    W.-X. Wang, X. Ni, Y.-C. Lai, and C. Grebogi, Physical Review E 83, 011917 (2011)

  32. [33]

    J. Park, Y. Do, Z. Huang, and Y. Lai, Chaos 23, 023128 (2013)

  33. [34]

    Huang, X

    W. Huang, X. Duan, L. Qin, and J. Park, Applied Math- ematics and Computation 456, 128135 (2023), early ac- cess: Jun 2023

  34. [35]

    H.-W. Lee, C. Cleveland, and A. Szolnoki, Chaos 32, 093103 (2022)

  35. [36]

    Menezes, M

    J. Menezes, M. Tenorio, and E. Rangel, EPL 139, 57002 (2022)

  36. [37]

    Park and B

    J. Park and B. Jang, Journal of the Korean Society for Industrial and Applied Mathematics 24, 351 (2020)

  37. [38]

    Park, EPL 126, 38004 (2019)

    J. Park, EPL 126, 38004 (2019). 11 Algorithm 1: RPS model with Q-learning Input: α, γ Initialization; Q1, Q2, Q3 ← random(15 × 7); Lattice point ← random[0, 3]L×L; σ, µ ← 1; Nstep ← 10; Learning Process; repeat for each round t do for Each agent do Agent picks a random action a ∈ A; for interaction count = 1 to Nstep × L2 do Randomly select an agent and i...

  38. [39]

    Z. Ding, G. Zheng, C. Cai, W. Cai, L. Chen, J. Zhang, and X. Wang, Chaos, Solitons & Fractals 175, 114032 (2023)

  39. [40]

    Zheng, J

    G. Zheng, J. Zhang, S. Deng, W. Cai, and L. Chen, Chaos, Solitons & Fractals 188, 115568 (2024)

  40. [41]

    Zheng, J

    G. Zheng, J. Zhang, J. Zhang, W. Cai, and L. Chen, New Journal of Physics 26, 053041 (2024)

  41. [42]

    Zheng, J

    G. Zheng, J. Zhang, X. Ou, S. Deng, and L. Chen, Phys- ical Review E 111, 064307 (2025)

  42. [43]

    Zheng, W

    G. Zheng, W. Cai, G. Qi, J. Zhang, and L. Chen, arXiv:2312.14970 (2023)

  43. [44]

    Zhang, J

    S. Zhang, J. Zhang, L. Chen, and X. Liu, Nonlinear Dynamics 99, 3301 (2020)

  44. [45]

    Zhang, S

    J. Zhang, S. Zhang, L. Chen, and X. Liu, Physical Re- view E 101, 042402 (2020)

  45. [46]

    M. M. Olsen and R. Fraczkowski, Journal of Computa- tional Science 9, 118 (2015). (a) 0 500 1000 1500 2000 2500 3000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 A B C empty sites (b) FIG. 10: Evolution in predation dominance scenarios. The typical pattern (a) and time series (b) in the predation dominance scenario with Rp = 40 and Rs = 0.5. Parameters: N...

  46. [47]

    X. Wang, J. Cheng, and L. Wang, Entropy 21, 773 (2019)

  47. [48]

    X. Wang, J. Cheng, and L. Wang, Ecological Complexity 42, 100815 (2020)

  48. [49]

    J. Park, J. Lee, T. Kim, I. Ahn, and J. Park, Entropy 23, 461 (2021)

  49. [50]

    J. Li, L. Li, and S. Zhao, New Journal of Physics 25, 092001 (2023)

  50. [51]

    Tsutsui, R

    K. Tsutsui, R. Tanaka, K. Takeda, and K. Fujii, Elife 13, e85694 (2024)

  51. [52]

    Si and T

    Z. Si and T. Ito, Chaos, Solitons & Fractals 199, 116628 (2025)

  52. [53]

    Reichenbach, M

    T. Reichenbach, M. Mobilia, and E. Frey, Journal of Theoretical Biology 254, 368 (2008)

  53. [54]

    C. J. C. H. Watkins and P. Dayan, Machine Learning 8, 279 (1992)

  54. [55]

    Sutton and A

    R. Sutton and A. Barto, Reinforcement Learning:An In- troduction (MIT press, 2018)

  55. [56]

    J. E. R. Staddon, Adaptive behavior and learning (Cam- bridge University Press, 1983)

  56. [57]

    A. J. Underwood, Experiments in ecology: their logi- cal design and interpretation using analysis of variance (Cambridge university press, 1997)