arxiv: 2605.12824 · v1 · submitted 2026-05-12 · 💻 cs.MA · cs.AI· cs.CL· cs.CY

Recognition: no theorem link

Mechanism Plausibility in Generative Agent-Based Modeling

Patrick Zhao , David Huu Pham , Nicholas Vincent

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CLcs.CY

keywords LLM agent-based modelsmechanism plausibilityphilosophy of sciencegenerative sufficiencysocial simulationsexplanatory models

0 comments

The pith

A four-level scale separates whether LLM-based agent models reproduce social phenomena from whether they plausibly explain how those phenomena arise through mechanisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper integrates work on large language model agent-based models with philosophy of science to define a notion of plausibility. It creates a four-level scale that distinguishes a model's ability to generate a phenomenon from its ability to show the mechanisms that could produce it. This matters because capability to reproduce behaviors is not the same as providing an explanation of how they occur. Modelers can use this to clarify if their simulations are advancing prediction or explanation. The scale is called the Mechanism Plausibility Scale.

Core claim

By combining recent LLM-ABM research with contemporary philosophy of science literature on mechanisms, the authors operationalize plausibility as a four-level scale. This scale separates the evaluation of a model's generative sufficiency, meaning its ability to reproduce a phenomenon, from its mechanistic plausibility, meaning how the phenomenon could be produced by related organized entities and activities. It also clarifies the distinct roles of predictive models versus explanatory models in agent-based simulations.

What carries the argument

The Mechanism Plausibility Scale, a four-level operationalization that evaluates generative sufficiency separately from mechanistic plausibility based on philosophy of science concepts of mechanisms.

If this is right

Simulations can be classified according to whether they merely reproduce observed behaviors or demonstrate plausible mechanisms for producing them.
Predictive models focus on capability to match data, while explanatory models require evidence of mechanistic pathways.
Modelers gain a grounded framework to describe experiment characteristics and assess progress toward explanation.
Evaluation of LLM-generated behaviors in social simulations becomes more structured by separating reproduction from explanation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers in other generative modeling domains could adapt the scale to assess mechanistic claims.
The scale suggests new experimental designs that test for specific mechanisms in agent behaviors.
Adoption might encourage more interdisciplinary work between computational modelers and philosophers of science.

Load-bearing premise

That concepts of mechanisms from philosophy of science can be directly turned into a practical four-level scale for assessing LLM agent behaviors without additional validation or adaptation to specific domains.

What would settle it

A study where independent raters apply the four-level scale to the same set of LLM-ABM papers and find that their ratings do not align on the mechanistic plausibility levels, or where models rated high on the scale fail to predict new mechanistic interventions.

Figures

Figures reproduced from arXiv: 2605.12824 by David Huu Pham, Nicholas Vincent, Patrick Zhao.

**Figure 1.** Figure 1: An adapted Craver diagram [19] showing a simulation producing 𝑇 with higher and lower-level mechanisms. In the ABM, the agents/entities {𝑥1, . . . , 𝑥𝑚 } (circles) and activities {𝜙1, . . . , 𝜙𝑛 } (arrows) work to produce 𝑇 . The agents in the ABM are further and reciprocally constituted by lower-level mechanisms, which are generally abstracted away for the purposes of tractability, but are also why simula… view at source ↗

**Figure 2.** Figure 2: The Plausibility scale classifies models based on their epistemic contribution. Level [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: The Mechanism Plausibility Scale in checklist form. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Example for Level 0: The Mechanism Plausibility Scale applied to an implementation of Conway’s Game of Life [ [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Example for Level 1: The Mechanism Plausibility Scale applied to a fabricated game theory paper. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Example for Level 2: The Mechanism Plausibility Scale applied to Schelling’s Model of Segregation [70] [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Example for Level 3: The Mechanism Plausibility Scale applied to the Artificial Anasazi Model [24] [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recently, research has aim to test whether they are capable of generating different phenomena of interest, for example, human behavior on social media platforms or performance in game-theoretic scenarios. However, capability, prediction, and explanation are different -- drawing from the philosophy of science and mechanisms literature, \textit{explanation} requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of `plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a four-level Mechanism Plausibility Scale to separate generative sufficiency from mechanistic explanation in LLM-ABMs, but supplies no concrete mappings or examples for applying it to black-box outputs.

read the letter

The core contribution is a new four-level scale that borrows mechanism concepts from philosophy of science and applies them to generative agent-based models. It distinguishes a simulation's ability to reproduce a target phenomenon from whether the reproduction occurs through something that counts as a mechanism. That separation is useful because many LLM-ABM papers currently treat successful generation as explanatory progress without further argument. The authors cite the relevant LLM-ABM literature and the main mechanism papers, so the framing is grounded rather than invented from scratch. The scale itself is presented as an operationalization that clarifies roles for predictive versus explanatory models. That is the part worth noting. The limitation is that the scale stays at the level of definitions. The abstract and the described framework give no worked examples of how to assign a level to a concrete LLM-generated behavior, no observable criteria for identifying organized entities or activities inside generated text, and no test of inter-rater reliability. Because LLMs do not expose internal states that map directly onto philosophical mechanisms, any usable scale needs explicit translation rules; those rules are missing. The result is a conceptual proposal that identifies a real distinction but leaves the practical step of application to the reader. This paper is for researchers who already work with generative agents and want a shared vocabulary for evaluation discussions. It is not yet ready for direct use in empirical work. I would send it to peer review because the underlying confusion it targets is common and the proposed distinction is coherent on its own terms, even if the current version needs substantial development to become a tool rather than a suggestion.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes integrating recent work on LLM-based agent-based models (ABMs) with philosophy-of-science literature on mechanisms to operationalize a four-level Mechanism Plausibility Scale. The scale is intended to separate evaluation of generative sufficiency (ability to reproduce a target phenomenon) from mechanistic plausibility (how the phenomenon is produced by organized entities and activities), thereby clarifying the distinct roles of predictive versus explanatory models.

Significance. If the scale can be made operational with concrete criteria, it would offer a useful conceptual tool for researchers working on generative ABMs to ground claims about explanation rather than mere reproduction. The integration of mechanism concepts from philosophy of science addresses a recognized gap in evaluating black-box LLM agents, and the distinction between sufficiency and plausibility could help structure future model assessments in social simulation.

major comments (2)

[Abstract / Scale definition] Abstract and the section defining the scale: no explicit criteria, level definitions, or mapping from philosophy-of-science mechanism concepts (organized entities and activities producing a phenomenon) to observable properties of LLM-generated text or agent behaviors are supplied. The central claim that the scale cleanly separates generative sufficiency from mechanistic plausibility therefore rests on unstated interpretive assumptions rather than operational rules.
[Abstract] Abstract: the manuscript states that the scale will be applied to phenomena such as social-media behavior and game-theoretic scenarios, yet provides no worked example, test case, or illustration of how any level would be assigned to an existing LLM-ABM output. Without such demonstrations the proposal remains conceptual and its usability for modelers cannot be assessed.

minor comments (2)

[Abstract] Abstract: 'research has aim to test' should read 'research has aimed to test'.
[Abstract] Abstract: the phrase 'different phenomena of interest' is vague; a single concrete reference to one of the cited domains would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and for recognizing the potential value of integrating mechanism concepts from philosophy of science with LLM-based ABMs. We agree that the current manuscript is primarily conceptual and that explicit operational criteria plus a worked example are necessary to demonstrate usability. We address each major comment below and will incorporate the requested clarifications in a revised version.

read point-by-point responses

Referee: [Abstract / Scale definition] Abstract and the section defining the scale: no explicit criteria, level definitions, or mapping from philosophy-of-science mechanism concepts (organized entities and activities producing a phenomenon) to observable properties of LLM-generated text or agent behaviors are supplied. The central claim that the scale cleanly separates generative sufficiency from mechanistic plausibility therefore rests on unstated interpretive assumptions rather than operational rules.

Authors: We accept this assessment. The four-level scale is defined by reference to the degree of alignment with mechanistic explanation (organized entities and activities), but the manuscript does not yet supply concrete, observable criteria for assigning levels to LLM outputs. In the revision we will expand the scale-definition section to list explicit criteria for each level, including mappings such as: Level 1 requires only output matching the target phenomenon; Level 2 requires evidence of intermediate steps interpretable as activities; Level 3 requires identifiable entities whose interactions produce those activities; Level 4 requires consistency with independently validated mechanisms. These criteria will be stated in terms of observable properties of generated text or behavior traces. revision: yes
Referee: [Abstract] Abstract: the manuscript states that the scale will be applied to phenomena such as social-media behavior and game-theoretic scenarios, yet provides no worked example, test case, or illustration of how any level would be assigned to an existing LLM-ABM output. Without such demonstrations the proposal remains conceptual and its usability for modelers cannot be assessed.

Authors: We agree that a concrete illustration is required to show how the scale functions in practice. Although the manuscript is a conceptual proposal, we will add a new subsection containing a worked example. Using a published LLM-ABM study on social-media posting behavior (or, alternatively, a game-theoretic coordination task), we will walk through the assignment of each level, citing specific output features that justify the rating and showing how the distinction between generative sufficiency and mechanistic plausibility is applied. revision: yes

Circularity Check

0 steps flagged

No circularity: Mechanism Plausibility Scale is an external operationalization, not a self-referential derivation.

full rationale

The paper proposes a four-level scale by integrating external philosophy-of-science literature on mechanisms with existing LLM-ABM work. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the scale definition to its own inputs by construction. The separation of generative sufficiency from mechanistic plausibility is presented as a conceptual framework drawn from cited sources rather than an internal fit or renaming. This is a standard non-circular proposal of a new evaluative tool.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of mechanism concepts from philosophy of science to LLM-generated agent behaviors and on the utility of a four-level ordinal scale for evaluation.

axioms (1)

domain assumption Explanation requires showing how a phenomenon is produced by related organized entities and activities
Invoked in the abstract when contrasting capability, prediction, and explanation

invented entities (1)

Mechanism Plausibility Scale no independent evidence
purpose: To rate LLM-ABMs on a four-level spectrum from generative sufficiency to mechanistic plausibility
Newly proposed construct; no independent evidence supplied in the abstract

pith-pipeline@v0.9.0 · 5521 in / 1179 out tokens · 45276 ms · 2026-05-14T19:21:05.050772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 66 canonical work pages · 3 internal anchors

[1]

Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee. 2024. The illusion of artificial inclusion. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–12. doi:10.1145/3613904.3642703 arXiv:2401.08572 [cs]

work page doi:10.1145/3613904.3642703 2024
[2]

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. 2025. Playing repeated games with Large Language Models.Nature Human Behaviour(May 2025). doi:10.1038/s41562-025-02172-y arXiv:2305.16867 [cs]

work page doi:10.1038/s41562-025-02172-y 2025
[3]

Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang

Altera AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, Peter Y. Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang. 2024. Project Sid: Many-agent simulations toward AI civilization. doi:10.48550/arXiv.2411.00114 arXiv:2411.00114 [cs]

work page doi:10.48550/arxiv.2411.00114 2024
[4]

Richardson, Austin C

Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. 2025. LLM Social Simulations Are a Promising Research Method. doi:10.48550/arXiv.2504.02234 arXiv:2504.02234 [cs]

work page doi:10.48550/arxiv.2504.02234 2025
[5]

Eckhart Arnold. 2013. Simulation Models of the Evolution of Cooperation as Proofs of Logical Possibilities. How Useful Are They?Etica E Politica15, 2 (2013), 101–138. https://philarchive.org/rec/ARNSMO Publisher: University of Trieste, Department of Philosophy

2013
[6]

Eckhart Arnold. 2015. How Models Fail: A Critical Look at the History of Computer Simulations of the Evolution of Cooperation. In Collective Agency and Cooperation in Natural and Artificial Systems, Catrin Misselhorn (Ed.). Springer International Publishing, Cham, 261–279. doi:10.1007/978-3-319-15515-9_14

work page doi:10.1007/978-3-319-15515-9_14 2015
[7]

Robert Axelrod. [n. d.]. The Evolution of Cooperation*. ([n. d.]). https://ee.stanford.edu/~hellman/Breakthrough/book/pdfs/axelrod.pdf
[8]

Emrah Aydinonat

N. Emrah Aydinonat. 2024. The puzzle of model-based explanation. InThe Routledge Handbook of Philosophy of Scientific Modeling(1 ed.). Routledge, London, 177–192. doi:10.4324/9781003205647-16

work page doi:10.4324/9781003205647-16 2024
[9]

Paul Bartha. 2024. Analogy and Analogical Reasoning. InThe Stanford Encyclopedia of Philosophy(fall 2024 ed.), Edward N. Zalta and Uri Nodelman (Eds.). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2024/entries/reasoning-analogy/

2024
[10]

James Bogen and James Woodward. 1988. Saving the Phenomena.The Philosophical Review97, 3 (1988), 303–352. jstor:2185445 doi:10.2307/2185445

work page doi:10.2307/2185445 1988
[11]

Alisa Bokulich. 2014. How the Tiger Bush Got its Stripes: ‘How Possibly’ vs. ‘How Actually’ Model Explanations.The Monist97, 3 (July 2014), 321–338. doi:10.5840/monist201497321

work page doi:10.5840/monist201497321 2014
[12]

Brandon and Robert N

Robert N. Brandon and Robert N. Brandon. 2014.Adaptation and environment. Princeton University Press, Princeton. doi:doi: 10.1515/9781400860661

work page doi:10.1515/9781400860661 2014
[13]

P. W. (Percy Williams) Bridgman. 1927.The Logic of Modern Physics. The Macmillan Company

1927
[14]

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona Vectors: Monitoring and Controlling Character Traits in Language Models. doi:10.48550/arXiv.2507.21509 arXiv:2507.21509 [cs]. 2https://asta.allen.ai/chat FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Zhao et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509 2025
[15]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. doi:10.48550/arXiv.2308.10848 arXiv:2308.10848 [cs]

work page doi:10.48550/arxiv.2308.10848 2023
[16]

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, and Arjun Yadav. 2024. GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents. doi:10.48550/arXiv.2406.06613 arXiv:2406.06613 [cs] version: 1

work page doi:10.48550/arxiv.2406.06613 2024
[17]

Carl Craver, James Tabery, and Phyllis Illari. 2024. Mechanisms in Science. InThe Stanford Encyclopedia of Philosophy(fall 2024 ed.), Edward N. Zalta and Uri Nodelman (Eds.). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2024/ entries/science-mechanisms/

2024
[18]

Carl F. Craver. 2006. When mechanistic models explain.Synthese153, 3 (Dec. 2006), 355–376. doi:10.1007/s11229-006-9097-x

work page doi:10.1007/s11229-006-9097-x 2006
[19]

Carl F. Craver. 2009.Explaining the Brain. Oxford University Press

2009
[20]

Cronbach and Paul E

Lee J. Cronbach and Paul E. Meehl. 1955. Construct validity in psychological tests.Psychological Bulletin52, 4 (1955), 281–302. doi:10.1037/h0040957 Place: US Publisher: American Psychological Association

work page doi:10.1037/h0040957 1955
[21]

1954.The aim and structure of physical theory

Pierre Maurice Marie Duhem. 1954.The aim and structure of physical theory. Vol. 1. Princeton University Press. Pages: 85-87

1954
[22]

2025.Deflating Mental Representation (The Jean Nicod Lectures)

Frances Egan. 2025.Deflating Mental Representation (The Jean Nicod Lectures). MIT Press (open access)

2025
[23]

Catherine Z. Elgin. 2004. True Enough.Philosophical Issues14 (2004), 113–131. https://www.jstor.org/stable/3050623 Publisher: [Wiley, Ridgeview Publishing Company]

work page arXiv 2004
[24]

Joshua M. Epstein. 2006.Generative Social Science: Studies in Agent-Based Computational Modeling(stu - student edition ed.). Princeton University Press. http://www.jstor.org/stable/j.ctt7rxj1

2006
[25]

1999.The genetical theory of natural selection: by R.A

Ronald Aylmer Fisher. 1999.The genetical theory of natural selection: by R.A. Fisher ; edited with a foreword and notes by J.H. Bennett(a complete variorum ed ed.). Oxford University Press, Oxford

1999
[26]

S3: Social-network simulation system with large language model-empowered agents

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2025.S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv:2307.14984 [cs] doi:10.48550/arXiv.2307.14984

work page doi:10.48550/arxiv.2307.14984 2025
[27]

Martin Gardner. 1970. Mathematical Games.Scientific American223, 4 (1970), 120–123. https://www.jstor.org/stable/24927642 Publisher: Scientific American, a division of Nature America, Inc

work page arXiv 1970
[28]

1979.Reliability and Validity Assessment

Edward G.Carmines and Richard A.Zeller. 1979.Reliability and Validity Assessment. SAGE Publications, Inc. doi:10.4135/9781412985642

work page doi:10.4135/9781412985642 1979
[29]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford
[30]

doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

Datasheets for Datasets. doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

work page doi:10.48550/arxiv.1803.09010
[31]

2017.The New Mechanical Philosophy

Stuart Glennan. 2017.The New Mechanical Philosophy. Oxford University Press, Oxford

2017
[32]

Stuart S. Glennan. 1996. Mechanisms and the nature of causation.Erkenntnis44, 1 (Jan. 1996), 49–71. doi:10.1007/BF00172853

work page doi:10.1007/bf00172853 1996
[33]

Claudius Graebner. 2018. How to Relate Models to Reality? An Epistemological Framework for the Validation and Verification of Computational Models.Journal of Artificial Societies and Social Simulation21, 3 (2018), 8

2018
[34]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...

work page doi:10.48550/arxiv.2402.00838 2024
[35]

Olivia Guest and Iris van Rooij. 2025. Critical Artificial Intelligence Literacy for Psychologists. doi:10.31234/osf.io/dkrgj_v1

work page doi:10.31234/osf.io/dkrgj_v1 2025
[36]

Fulin Guo. 2023. GPT in Game Theory Experiments. doi:10.48550/arXiv.2305.05516 arXiv:2305.05516 [econ]

work page doi:10.48550/arxiv.2305.05516 2023
[37]

John J. Horton. 2023. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? doi:10.48550/ arXiv.2301.07543 arXiv:2301.07543 [econ]

work page arXiv 2023
[38]

Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. 2024. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars. doi:10.48550/arXiv.2311.17227 arXiv:2311.17227 [cs]

work page doi:10.48550/arxiv.2311.17227 2024
[39]

Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman

Brandon Jackson, B Cavello, Flynn Devine, Nick Garcia, Samuel J. Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman. 2024. Public AI: Infrastructure for the common good. doi:10.5281/zenodo.13914560

work page doi:10.5281/zenodo.13914560 2024
[40]

Frank Jackson. 1982. Epiphenomenal Qualia.The Philosophical Quarterly32, 127 (April 1982), 127–136. doi:10.2307/2960077

work page doi:10.2307/2960077 1982
[41]

Zhao Kaiya, Michelangelo Naim, Jovana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, and Andrew Ahn. 2023. Lyfe Agents: Generative Agents for Low-Cost Real-Time Social Interactions. arXiv:2310.02172 [cs] doi:10.48550/arXiv.2310.02172

work page doi:10.48550/arxiv.2310.02172 2023
[42]

David Michael Kaplan and Carl F. Craver. 2011. The Explanatory Force of Dynamical and Mathematical Models in Neuroscience: A Mechanistic Perspective*.Philosophy of Science78, 4 (2011), 601–627. doi:10.1086/661755 Publisher: [The University of Chicago Press, Philosophy of Science Association]

work page doi:10.1086/661755 2011
[43]

Kendrick N. Kay. 2018. Principles for models of neural information processing.NeuroImage180 (Oct. 2018), 101–109. doi:10.1016/j. neuroimage.2017.08.016 Mechanism Plausibility in Generative Agent-Based Modeling FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

work page doi:10.1016/j 2018
[44]

Benjamin Kempinski, Ian Gemp, Kate Larson, Marc Lanctot, Yoram Bachrach, and Tal Kachman. 2025. Game of Thoughts: Iterative Reasoning in Game-Theoretic Domains with Large Language Models. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’25). International Foundation for Autonomous Agents and Multiagent...

2025
[45]

J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data.Biometrics33, 1 (March 1977), 159–174

1977
[46]

Maik Larooij and Petter Törnberg. 2025. Do Large Language Models Solve the Problems of Agent-Based Modeling? A Critical Review of Generative Social Simulations. doi:10.48550/arXiv.2504.03274 arXiv:2504.03274 [cs]

work page doi:10.48550/arxiv.2504.03274 2025
[47]

Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. Mind the Gap: Assessing Temporal Generalization in Neural Language Models. doi:10.48550/arXiv.2102.01951 arXiv:2102.01951 [cs]

work page doi:10.48550/arxiv.2102.01951 2021
[48]

Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, and Katia Sycara. 2023. Theory of Mind for Multi-Agent Collaboration via Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 180–192. doi:10.18653/v1/2023.emnlp-main.13 arXiv:2310.10701 [cs]

work page doi:10.18653/v1/2023.emnlp-main.13 2023
[49]

Malthouse

Xinyi Li, Yu Xu, Yongfeng Zhang, and Edward C. Malthouse. 2024. Large Language Model-driven Multi-Agent Simulation for News Diffusion Under Different Network Structures. doi:10.48550/arXiv.2410.13909 arXiv:2410.13909 [cs]

work page doi:10.48550/arxiv.2410.13909 2024
[50]

Yuxuan Li, Sauvik Das, and Hirokazu Shirado. 2025. What Makes LLM Agent Simulations Useful for Policy? Insights From an Iterative Design Engagement in Emergency Preparedness. doi:10.48550/arXiv.2509.21868 arXiv:2509.21868 [cs]

work page doi:10.48550/arxiv.2509.21868 2025
[51]

Yuxuan Li and Hirokazu Shirado. 2025. Spontaneous Giving and Calculated Greed in Language Models. doi:10.48550/arXiv.2502.17720 arXiv:2502.17720 [cs]

work page doi:10.48550/arxiv.2502.17720 2025
[52]

Yuhan Liu, Xiuying Chen, Xiaoqing Zhang, Xing Gao, Ji Zhang, and Rui Yan. 2024. From Skepticism to Acceptance: Simulating the Attitude Dynamics Toward Fake News. InProceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence. 7849–7857. doi:10.24963/ijcai.2024/873 arXiv:2403.09498 [cs]

work page doi:10.24963/ijcai.2024/873 2024
[53]

Kenneth MacCorquodale and Paul E. Meehl. 1948. On a Distinction between Hypothetical Constructs and Intervening Variables. Psychological Review55, 2 (1948), 95–107. doi:10.1037/h0056029

work page doi:10.1037/h0056029 1948
[54]

Peter Machamer, Lindley Darden, and Carl F. Craver. 2000. Thinking about Mechanisms.Philosophy of Science67, 1 (2000), 1–25. https://www.jstor.org/stable/188611 Publisher: [The University of Chicago Press, Philosophy of Science Association]

2000
[55]

Giordano De Marzo, Luciano Pietronero, and David Garcia. 2023. Emergence of Scale-Free Networks in Social Interactions among Large Language Models. doi:10.48550/arXiv.2312.06619 arXiv:2312.06619 [physics]

work page doi:10.48550/arxiv.2312.06619 2023
[56]

Michela Massimi. 2022. Perspectival Ontology: Between Situated Knowledge and Multiculturalism.The Monist105, 2 (March 2022), 214–228. doi:10.1093/monist/onab032

work page doi:10.1093/monist/onab032 2022
[57]

Michael D. Mauk. 2000. The potential effectiveness of simulations versus phenomenological models.Nature Neuroscience3, 7 (July 2000), 649–651. doi:10.1038/76606 Publisher: Nature Publishing Group

work page doi:10.1038/76606 2000
[58]

Johansson, Structural and electronic relationships between the lanthanide and actinide elements, Hy- perfine Interactions 128 (2000) 41–66

James W. McAllister. 1997. Phenomena and Patterns in Data Sets.Erkenntnis (1975-)47, 2 (1997), 217–228. jstor:20012798 doi:10.1023/A: 1005387021520

work page doi:10.1023/a: 1997
[59]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency. 220–229. doi:10.1145/3287560.3287596 arXiv:1810.03993 [cs]

work page doi:10.1145/3287560.3287596 2019
[60]

Morgan and Margaret Morrison (Eds.)

Mary S. Morgan and Margaret Morrison (Eds.). 1999.Models as Mediators: Perspectives on Natural and Social Science. Cambridge University Press, Cambridge. doi:10.1017/CBO9780511660108

work page doi:10.1017/cbo9780511660108 1999
[61]

Robert Northcott and Anna Alexandrova. 2015. Prisoner’s Dilemma Doesn’t Explain Much. InThe Prisoner?s Dilemma. Classic philosophical arguments., Martin Peterson (Ed.). Cambridge University Press, 64–84. https://philarchive.org/rec/NORPDD

2015
[62]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. doi:10.48550/arXiv.2304.03442 arXiv:2304.03442 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.03442 2023
[63]

Bernstein

Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2022. Social Simulacra: Creating Populated Prototypes for Social Computing Systems.Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(Oct. 2022), 1–18. doi:10.1145/3526113.3545616 Conference Name: UIST ’22: The 3...

work page doi:10.1145/3526113.3545616 2022
[64]

Wendy S. Parker. 2020. Model Evaluation: An Adequacy-for-Purpose View.Philosophy of Science87, 3 (July 2020), 457–477. doi:10.1086/ 708691

2020
[65]

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. 2024. Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning. arXiv. doi:10.48550/ARXIV.2402.13950 Version Number: 4

work page doi:10.48550/arxiv.2402.13950 2024
[66]

2009.Causality(2 ed.)

Judea Pearl. 2009.Causality(2 ed.). Cambridge University Press, Cambridge. doi:10.1017/CBO9780511803161

work page doi:10.1017/cbo9780511803161 2009
[67]

2018.The book of why: The new science of cause and effect(1 ed.)

Judea Pearl and Dana Mackenzie. 2018.The book of why: The new science of cause and effect(1 ed.). Basic Books, Inc., USA

2018
[68]

Axel Pichler and Nils Reiter. 2022. From Concepts to Texts and Back: Operationalization as a Core Activity of Digital Humanities. Journal of Cultural Analytics7, 4 (Dec. 2022). doi:10.22148/001c.57195 FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Zhao et al

work page doi:10.22148/001c.57195 2022
[69]

1953.From a Logical Point of View

Willard Van Orman Quine. 1953.From a Logical Point of View. Harvard University Press, Cambridge

1953
[70]

Siyue Ren, Zhiyao Cui, Ruiqi Song, Zhen Wang, and Shuyue Hu. 2024. Emergence of Social Norms in Generative Agent Societies: Principles and Architecture. doi:10.48550/arXiv.2403.08251 arXiv:2403.08251 [cs]

work page doi:10.48550/arxiv.2403.08251 2024
[71]

Schelling

Thomas C. Schelling. 1969. Models of Segregation.The American Economic Review59, 2 (1969), 488–493. https://www.jstor.org/stable/ 1823701 Publisher: American Economic Association

1969
[72]

Galit Shmueli. 2010. To Explain or to Predict?Statist. Sci.25, 3 (Aug. 2010). doi:10.1214/10-STS330

work page doi:10.1214/10-sts330 2010
[73]

Gareth Polhill, Bruce Edmonds, Petra Ahrweiler, Patrycja Antosz, Geeske Scholz, Emile Chappin, Melania Borit, Harko Verhagen, Francesca Giardini, and Nigel Gilbert

Flaminio Squazzoni, J. Gareth Polhill, Bruce Edmonds, Petra Ahrweiler, Patrycja Antosz, Geeske Scholz, Emile Chappin, Melania Borit, Harko Verhagen, Francesca Giardini, and Nigel Gilbert. 2020. Computational Models That Matter During a Global Pandemic Outbreak: A Call to Action.JASSS - The Journal of Artificial Societies and Social Simulation23, 2 (March ...

work page doi:10.18564/jasss.4298 2020
[74]

S. S. Stevens. 1935. The Operational Definition of Psychological Concepts.Psychological Review42, 6 (1935), 517–527. doi:10.1037/h0056973

work page doi:10.1037/h0056973 1935
[75]

Samarth Swarup. 2019. Adequacy: What Makes a Simulation Good Enough?. In2019 Spring Simulation Conference (SpringSim). 1–12. doi:10.23919/SpringSim.2019.8732895

work page doi:10.23919/springsim.2019.8732895 2019
[76]

1910.A Text-Book of Psychology

Edward Bradford Titchener. 1910.A Text-Book of Psychology. MacMillan Co, New York, NY, US. xx, 565 pages. doi:10.1037/10907-000

work page doi:10.1037/10907-000 1910
[77]

Loïs Vanhée, Melania Borit, Peer-Olaf Siebers, Roger Cremades, Christopher Frantz, Önder Gürcan, František Kalvas, Denisa Reshef Kera, Vivek Nallur, Kavin Narasimhan, and Martin Neumann. 2025. Large Language Models for Agent-Based Modelling: Current and possible uses across the modelling cycle. doi:10.48550/arXiv.2507.05723 arXiv:2507.05723 [cs] version: 1

work page doi:10.48550/arxiv.2507.05723 2025
[78]

Elina Vessonen. 2021. Conceptual engineering and operationalism in psychology.Synthese199, 3 (Dec. 2021), 10615–10637. doi:10.1007/ s11229-021-03261-x

2021
[79]

Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. 2025. User Behavior Simulation with Large Language Model-based Agents.ACM Trans. Inf. Syst.43, 2 (Jan. 2025), 55:1–55:37. doi:10.1145/3708985

work page doi:10.1145/3708985 2025
[80]

Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. 2023. Humanoid Agents: Platform for Simulating Human-like Generative Agents. doi:10.48550/arXiv.2310.05418 arXiv:2310.05418 [cs]

work page doi:10.48550/arxiv.2310.05418 2023

Showing first 80 references.