Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

Carlos M. Fernandes; Daniel Fernandes; Jo\~ao P. Matos-Carvalho; Nuno Fachada

arxiv: 2602.10140 · v2 · submitted 2026-02-08 · 💻 cs.SE · cs.AI· cs.MA

Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

Nuno Fachada , Daniel Fernandes , Carlos M. Fernandes , Jo\~ao P. Matos-Carvalho This is my paper

Pith reviewed 2026-05-16 05:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA

keywords large language modelsagent-based modelsODD protocolmodel replicationpredator-prey modelPython code generationNetLogo baseline

0 comments

The pith

LLMs can generate behaviorally faithful agent-based models from ODD specifications, though this is not guaranteed and executability alone is insufficient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests 17 large language models on translating a detailed ODD specification of the PPHPC predator-prey model into Python code. It evaluates the generated models against a trusted NetLogo implementation using statistical comparisons for behavioral match, plus checks for efficiency and maintainability. Some models produce code that runs and matches the original dynamics statistically, but many do not, even if executable. This shows LLMs have potential as tools for model replication but require careful validation beyond just running the code.

Core claim

Behaviorally faithful implementations of agent-based models from ODD descriptions are achievable with certain LLMs like GPT-4.1, which consistently yields statistically valid and efficient Python code matching a NetLogo baseline, but this success is not guaranteed across models and executability by itself does not suffice for scientific applications in replication and validation.

What carries the argument

The ODD protocol specification of the PPHPC predator-prey model, combined with staged executability checks and model-independent statistical comparisons to a validated NetLogo reference implementation.

If this is right

Behaviorally faithful implementations are achievable but not guaranteed.
Executability alone is insufficient for scientific use of the generated models.
GPT-4.1 consistently produces statistically valid and efficient implementations.
Claude 3.7 Sonnet performs well but less reliably than GPT-4.1.
These findings have implications for reproducible agent-based and ecological modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLMs could speed up initial prototyping of agent-based models, but human oversight remains essential for validation.
Similar ODD-to-code tasks might work for other ecological or social models beyond predator-prey.
Future work could test if prompting strategies improve reliability across more models.

Load-bearing premise

The selected statistical metrics provide enough evidence to confirm full behavioral equivalence between the generated models and the NetLogo baseline.

What would settle it

Observing significant differences in population statistics or dynamics when running the generated Python model with varied random seeds or initial conditions compared to the reference.

read the original abstract

Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and ecological modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs can turn ODD specs into runnable Python ABM code with uneven success, and the statistical checks against NetLogo give a useful but incomplete picture of behavioral match.

read the letter

The paper evaluates 17 LLMs on generating Python implementations of the PPHPC predator-prey model from its ODD description, then checks whether the code runs and produces population statistics close to a validated NetLogo version. GPT-4.1 comes out ahead on both statistical agreement and runtime efficiency, while others are less consistent. Claude 3.7 Sonnet is the next best but still variable. The staged evaluation—first executability, then stats, then efficiency and maintainability metrics—is a clear step forward from anecdotal reports of LLM coding help. Using a fully specified reference model with an external baseline is the right way to ground the test, and the conclusion that simply getting code to run is not enough for scientific use is worth stating plainly. The work is new in its scale and controls; prior papers have not run this many models through the same ODD-to-code pipeline with quantitative comparison. The soft spot is the validation itself. Aggregate means, variances, and basic hypothesis tests on time series can align even when spatial clustering, extinction timing, or sensitivity to initial conditions diverge. If the paper does not add pattern-oriented checks or agent-level trajectory comparisons, the claim of behavioral faithfulness rests on a narrower base than the abstract suggests. That does not invalidate the results, but it does mean the positive findings for GPT-4.1 are provisional until deeper equivalence tests are shown. This is the kind of paper that belongs in a methods or tools section of an ecological modeling journal. Researchers who build or replicate ABMs will find the concrete numbers useful even if they disagree with how far the statistics can be pushed. It is coherent on its own terms and engages the right literature, so it deserves a full referee process rather than a desk rejection.

Referee Report

2 major / 3 minor

Summary. The manuscript evaluates 17 LLMs on an ODD-to-Python code translation task for the PPHPC predator-prey ABM, using staged executability checks, model-independent statistical comparisons to a validated NetLogo baseline, and quantitative metrics for runtime efficiency and maintainability. It concludes that behaviorally faithful implementations are achievable but not guaranteed, that executability alone is insufficient for scientific use, and that GPT-4.1 produces statistically valid and efficient implementations most consistently.

Significance. If the statistical comparisons are robust, the work supplies a controlled, replicable benchmark for LLM-assisted ABM replication that is directly relevant to reproducible modeling in ecology and complex systems. The use of a fully specified ODD reference and an external validated baseline, together with explicit efficiency and maintainability measures, strengthens the empirical contribution over purely anecdotal LLM code-generation studies.

major comments (2)

[§4.2] §4.2 (Statistical Validation): The claim of 'behaviorally faithful' implementations rests on aggregate population statistics (means, variances, and simple hypothesis tests on time series). For the stochastic PPHPC model these metrics can match while spatial clustering, extinction-risk distributions, or sensitivity to stochastic events diverge; the manuscript does not report pattern-oriented validation or agent-level trajectory matching, so the central 'faithful' label and the assertion that executability is insufficient both depend on an untested assumption that the chosen proxies suffice.
[§3.3] §3.3 (Evaluation Protocol): Sample sizes for the statistical tests, exact test procedures (e.g., t-test vs. Kolmogorov-Smirnov), number of independent stochastic realizations per LLM, and exclusion criteria for non-convergent runs are not fully specified. These details are load-bearing for interpreting the performance ranking of GPT-4.1 versus Claude 3.7 Sonnet and for the reproducibility of the 'statistically valid' conclusion.

minor comments (3)

[Abstract] Abstract and §2: 'GPT-4.1' should be clarified (exact model identifier and release date) to avoid ambiguity with standard GPT-4 variants.
[§3.1] Table 1 or §3.1: Provide the complete list of the 17 LLMs with versions and providers so readers can replicate the exact prompt-engineering setup.
[§5] §5 (Discussion): The generalizability claim beyond PPHPC would be strengthened by a brief note on how the ODD protocol and metrics would transfer to other ABMs with different spatial or network structures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us clarify the scope and limitations of our evaluation. We address each major point below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Statistical Validation): The claim of 'behaviorally faithful' implementations rests on aggregate population statistics (means, variances, and simple hypothesis tests on time series). For the stochastic PPHPC model these metrics can match while spatial clustering, extinction-risk distributions, or sensitivity to stochastic events diverge; the manuscript does not report pattern-oriented validation or agent-level trajectory matching, so the central 'faithful' label and the assertion that executability is insufficient both depend on an untested assumption that the chosen proxies suffice.

Authors: We agree that aggregate population statistics represent a necessary but not sufficient condition for claiming behavioral fidelity in a stochastic spatial model such as PPHPC. Our study deliberately focused on model-independent metrics that can be computed from any implementation's output without requiring modifications to the code or access to internal agent states. This choice was made to enable a fair comparison across all 17 LLMs. We recognize that this approach does not capture potential discrepancies in spatial patterns or extinction dynamics. In the revised version, we will add a dedicated paragraph in §4.2 discussing the limitations of the chosen validation proxies and recommending pattern-oriented modeling (POM) techniques for future LLM-assisted ABM replications. We believe this strengthens rather than undermines our core conclusion that executability alone is insufficient, since even the basic statistics already reveal failures in several models. revision: partial
Referee: [§3.3] §3.3 (Evaluation Protocol): Sample sizes for the statistical tests, exact test procedures (e.g., t-test vs. Kolmogorov-Smirnov), number of independent stochastic realizations per LLM, and exclusion criteria for non-convergent runs are not fully specified. These details are load-bearing for interpreting the performance ranking of GPT-4.1 versus Claude 3.7 Sonnet and for the reproducibility of the 'statistically valid' conclusion.

Authors: We thank the referee for pointing out these omissions. The protocol in §3.3 used 30 independent runs per LLM to account for stochasticity, with each run simulating 1000 time steps. Statistical comparisons employed two-sample t-tests for mean population sizes, Levene's test for variances, and Kolmogorov-Smirnov tests for the full distribution of time-series values. Runs were excluded if they failed to execute or produced invalid outputs (e.g., division by zero or non-numeric results), which occurred in fewer than 8% of cases. We will revise §3.3 to explicitly state these parameters, including the rationale for the sample size and the exact statistical procedures used, thereby improving the reproducibility of our findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation against external baseline

full rationale

The paper performs an empirical replication study: LLMs translate a fixed ODD specification of the PPHPC model into Python code, which is then executed and compared to an independent, pre-validated NetLogo baseline using standard statistical metrics (means, variances, hypothesis tests) plus separate efficiency and maintainability measures. No parameters are fitted to the target results, no self-definitional equations are used, and no load-bearing claims reduce to self-citations or prior author work by construction. The central finding—that faithful implementations are achievable but not guaranteed—rests on direct, falsifiable comparison to the external reference implementation rather than any internal derivation that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that the PPHPC NetLogo implementation is a correct reference and that the chosen statistical metrics adequately capture behavioral fidelity. No free parameters or invented entities are introduced.

axioms (1)

domain assumption The PPHPC NetLogo implementation serves as a valid and complete behavioral reference for the ODD specification.
Used as the baseline for all statistical comparisons in the evaluation.

pith-pipeline@v0.9.0 · 5484 in / 1166 out tokens · 31264 ms · 2026-05-16T05:49:46.018201+00:00 · methodology

Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)