pith. machine review for the scientific record. sign in

arxiv: 2605.12824 · v1 · submitted 2026-05-12 · 💻 cs.MA · cs.AI· cs.CL· cs.CY

Recognition: no theorem link

Mechanism Plausibility in Generative Agent-Based Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CLcs.CY
keywords LLM agent-based modelsmechanism plausibilityphilosophy of sciencegenerative sufficiencysocial simulationsexplanatory models
0
0 comments X

The pith

A four-level scale separates whether LLM-based agent models reproduce social phenomena from whether they plausibly explain how those phenomena arise through mechanisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper integrates work on large language model agent-based models with philosophy of science to define a notion of plausibility. It creates a four-level scale that distinguishes a model's ability to generate a phenomenon from its ability to show the mechanisms that could produce it. This matters because capability to reproduce behaviors is not the same as providing an explanation of how they occur. Modelers can use this to clarify if their simulations are advancing prediction or explanation. The scale is called the Mechanism Plausibility Scale.

Core claim

By combining recent LLM-ABM research with contemporary philosophy of science literature on mechanisms, the authors operationalize plausibility as a four-level scale. This scale separates the evaluation of a model's generative sufficiency, meaning its ability to reproduce a phenomenon, from its mechanistic plausibility, meaning how the phenomenon could be produced by related organized entities and activities. It also clarifies the distinct roles of predictive models versus explanatory models in agent-based simulations.

What carries the argument

The Mechanism Plausibility Scale, a four-level operationalization that evaluates generative sufficiency separately from mechanistic plausibility based on philosophy of science concepts of mechanisms.

If this is right

  • Simulations can be classified according to whether they merely reproduce observed behaviors or demonstrate plausible mechanisms for producing them.
  • Predictive models focus on capability to match data, while explanatory models require evidence of mechanistic pathways.
  • Modelers gain a grounded framework to describe experiment characteristics and assess progress toward explanation.
  • Evaluation of LLM-generated behaviors in social simulations becomes more structured by separating reproduction from explanation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers in other generative modeling domains could adapt the scale to assess mechanistic claims.
  • The scale suggests new experimental designs that test for specific mechanisms in agent behaviors.
  • Adoption might encourage more interdisciplinary work between computational modelers and philosophers of science.

Load-bearing premise

That concepts of mechanisms from philosophy of science can be directly turned into a practical four-level scale for assessing LLM agent behaviors without additional validation or adaptation to specific domains.

What would settle it

A study where independent raters apply the four-level scale to the same set of LLM-ABM papers and find that their ratings do not align on the mechanistic plausibility levels, or where models rated high on the scale fail to predict new mechanistic interventions.

Figures

Figures reproduced from arXiv: 2605.12824 by David Huu Pham, Nicholas Vincent, Patrick Zhao.

Figure 1
Figure 1. Figure 1: An adapted Craver diagram [19] showing a simulation producing 𝑇 with higher and lower-level mechanisms. In the ABM, the agents/entities {𝑥1, . . . , 𝑥𝑚 } (circles) and activities {𝜙1, . . . , 𝜙𝑛 } (arrows) work to produce 𝑇 . The agents in the ABM are further and reciprocally constituted by lower-level mechanisms, which are generally abstracted away for the purposes of tractability, but are also why simula… view at source ↗
Figure 2
Figure 2. Figure 2: The Plausibility scale classifies models based on their epistemic contribution. Level [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Mechanism Plausibility Scale in checklist form. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example for Level 0: The Mechanism Plausibility Scale applied to an implementation of Conway’s Game of Life [ [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example for Level 1: The Mechanism Plausibility Scale applied to a fabricated game theory paper. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example for Level 2: The Mechanism Plausibility Scale applied to Schelling’s Model of Segregation [70] [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example for Level 3: The Mechanism Plausibility Scale applied to the Artificial Anasazi Model [24] [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recently, research has aim to test whether they are capable of generating different phenomena of interest, for example, human behavior on social media platforms or performance in game-theoretic scenarios. However, capability, prediction, and explanation are different -- drawing from the philosophy of science and mechanisms literature, \textit{explanation} requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of `plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes integrating recent work on LLM-based agent-based models (ABMs) with philosophy-of-science literature on mechanisms to operationalize a four-level Mechanism Plausibility Scale. The scale is intended to separate evaluation of generative sufficiency (ability to reproduce a target phenomenon) from mechanistic plausibility (how the phenomenon is produced by organized entities and activities), thereby clarifying the distinct roles of predictive versus explanatory models.

Significance. If the scale can be made operational with concrete criteria, it would offer a useful conceptual tool for researchers working on generative ABMs to ground claims about explanation rather than mere reproduction. The integration of mechanism concepts from philosophy of science addresses a recognized gap in evaluating black-box LLM agents, and the distinction between sufficiency and plausibility could help structure future model assessments in social simulation.

major comments (2)
  1. [Abstract / Scale definition] Abstract and the section defining the scale: no explicit criteria, level definitions, or mapping from philosophy-of-science mechanism concepts (organized entities and activities producing a phenomenon) to observable properties of LLM-generated text or agent behaviors are supplied. The central claim that the scale cleanly separates generative sufficiency from mechanistic plausibility therefore rests on unstated interpretive assumptions rather than operational rules.
  2. [Abstract] Abstract: the manuscript states that the scale will be applied to phenomena such as social-media behavior and game-theoretic scenarios, yet provides no worked example, test case, or illustration of how any level would be assigned to an existing LLM-ABM output. Without such demonstrations the proposal remains conceptual and its usability for modelers cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract: 'research has aim to test' should read 'research has aimed to test'.
  2. [Abstract] Abstract: the phrase 'different phenomena of interest' is vague; a single concrete reference to one of the cited domains would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and for recognizing the potential value of integrating mechanism concepts from philosophy of science with LLM-based ABMs. We agree that the current manuscript is primarily conceptual and that explicit operational criteria plus a worked example are necessary to demonstrate usability. We address each major comment below and will incorporate the requested clarifications in a revised version.

read point-by-point responses
  1. Referee: [Abstract / Scale definition] Abstract and the section defining the scale: no explicit criteria, level definitions, or mapping from philosophy-of-science mechanism concepts (organized entities and activities producing a phenomenon) to observable properties of LLM-generated text or agent behaviors are supplied. The central claim that the scale cleanly separates generative sufficiency from mechanistic plausibility therefore rests on unstated interpretive assumptions rather than operational rules.

    Authors: We accept this assessment. The four-level scale is defined by reference to the degree of alignment with mechanistic explanation (organized entities and activities), but the manuscript does not yet supply concrete, observable criteria for assigning levels to LLM outputs. In the revision we will expand the scale-definition section to list explicit criteria for each level, including mappings such as: Level 1 requires only output matching the target phenomenon; Level 2 requires evidence of intermediate steps interpretable as activities; Level 3 requires identifiable entities whose interactions produce those activities; Level 4 requires consistency with independently validated mechanisms. These criteria will be stated in terms of observable properties of generated text or behavior traces. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript states that the scale will be applied to phenomena such as social-media behavior and game-theoretic scenarios, yet provides no worked example, test case, or illustration of how any level would be assigned to an existing LLM-ABM output. Without such demonstrations the proposal remains conceptual and its usability for modelers cannot be assessed.

    Authors: We agree that a concrete illustration is required to show how the scale functions in practice. Although the manuscript is a conceptual proposal, we will add a new subsection containing a worked example. Using a published LLM-ABM study on social-media posting behavior (or, alternatively, a game-theoretic coordination task), we will walk through the assignment of each level, citing specific output features that justify the rating and showing how the distinction between generative sufficiency and mechanistic plausibility is applied. revision: yes

Circularity Check

0 steps flagged

No circularity: Mechanism Plausibility Scale is an external operationalization, not a self-referential derivation.

full rationale

The paper proposes a four-level scale by integrating external philosophy-of-science literature on mechanisms with existing LLM-ABM work. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the scale definition to its own inputs by construction. The separation of generative sufficiency from mechanistic plausibility is presented as a conceptual framework drawn from cited sources rather than an internal fit or renaming. This is a standard non-circular proposal of a new evaluative tool.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of mechanism concepts from philosophy of science to LLM-generated agent behaviors and on the utility of a four-level ordinal scale for evaluation.

axioms (1)
  • domain assumption Explanation requires showing how a phenomenon is produced by related organized entities and activities
    Invoked in the abstract when contrasting capability, prediction, and explanation
invented entities (1)
  • Mechanism Plausibility Scale no independent evidence
    purpose: To rate LLM-ABMs on a four-level spectrum from generative sufficiency to mechanistic plausibility
    Newly proposed construct; no independent evidence supplied in the abstract

pith-pipeline@v0.9.0 · 5521 in / 1179 out tokens · 45276 ms · 2026-05-14T19:21:05.050772+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 66 canonical work pages · 3 internal anchors

  1. [1]

    Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

    William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee. 2024. The illusion of artificial inclusion. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–12. doi:10.1145/3613904.3642703 arXiv:2401.08572 [cs]

  2. [2]

    Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. 2025. Playing repeated games with Large Language Models.Nature Human Behaviour(May 2025). doi:10.1038/s41562-025-02172-y arXiv:2305.16867 [cs]

  3. [3]

    Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang

    Altera AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, Peter Y. Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang. 2024. Project Sid: Many-agent simulations toward AI civilization. doi:10.48550/arXiv.2411.00114 arXiv:2411.00114 [cs]

  4. [4]

    Richardson, Austin C

    Jacy Reese Anthis, Ryan Liu, Sean M. Richardson, Austin C. Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. 2025. LLM Social Simulations Are a Promising Research Method. doi:10.48550/arXiv.2504.02234 arXiv:2504.02234 [cs]

  5. [5]

    Eckhart Arnold. 2013. Simulation Models of the Evolution of Cooperation as Proofs of Logical Possibilities. How Useful Are They?Etica E Politica15, 2 (2013), 101–138. https://philarchive.org/rec/ARNSMO Publisher: University of Trieste, Department of Philosophy

  6. [6]

    Eckhart Arnold. 2015. How Models Fail: A Critical Look at the History of Computer Simulations of the Evolution of Cooperation. In Collective Agency and Cooperation in Natural and Artificial Systems, Catrin Misselhorn (Ed.). Springer International Publishing, Cham, 261–279. doi:10.1007/978-3-319-15515-9_14

  7. [7]

    Robert Axelrod. [n. d.]. The Evolution of Cooperation*. ([n. d.]). https://ee.stanford.edu/~hellman/Breakthrough/book/pdfs/axelrod.pdf

  8. [8]

    Emrah Aydinonat

    N. Emrah Aydinonat. 2024. The puzzle of model-based explanation. InThe Routledge Handbook of Philosophy of Scientific Modeling(1 ed.). Routledge, London, 177–192. doi:10.4324/9781003205647-16

  9. [9]

    Paul Bartha. 2024. Analogy and Analogical Reasoning. InThe Stanford Encyclopedia of Philosophy(fall 2024 ed.), Edward N. Zalta and Uri Nodelman (Eds.). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2024/entries/reasoning-analogy/

  10. [10]

    James Bogen and James Woodward. 1988. Saving the Phenomena.The Philosophical Review97, 3 (1988), 303–352. jstor:2185445 doi:10.2307/2185445

  11. [11]

    Alisa Bokulich. 2014. How the Tiger Bush Got its Stripes: ‘How Possibly’ vs. ‘How Actually’ Model Explanations.The Monist97, 3 (July 2014), 321–338. doi:10.5840/monist201497321

  12. [12]

    Brandon and Robert N

    Robert N. Brandon and Robert N. Brandon. 2014.Adaptation and environment. Princeton University Press, Princeton. doi:doi: 10.1515/9781400860661

  13. [13]

    P. W. (Percy Williams) Bridgman. 1927.The Logic of Modern Physics. The Macmillan Company

  14. [14]

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona Vectors: Monitoring and Controlling Character Traits in Language Models. doi:10.48550/arXiv.2507.21509 arXiv:2507.21509 [cs]. 2https://asta.allen.ai/chat FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Zhao et al

  15. [15]

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. doi:10.48550/arXiv.2308.10848 arXiv:2308.10848 [cs]

  16. [16]

    Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, and Arjun Yadav. 2024. GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents. doi:10.48550/arXiv.2406.06613 arXiv:2406.06613 [cs] version: 1

  17. [17]

    Carl Craver, James Tabery, and Phyllis Illari. 2024. Mechanisms in Science. InThe Stanford Encyclopedia of Philosophy(fall 2024 ed.), Edward N. Zalta and Uri Nodelman (Eds.). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/fall2024/ entries/science-mechanisms/

  18. [18]

    Carl F. Craver. 2006. When mechanistic models explain.Synthese153, 3 (Dec. 2006), 355–376. doi:10.1007/s11229-006-9097-x

  19. [19]

    Carl F. Craver. 2009.Explaining the Brain. Oxford University Press

  20. [20]

    Cronbach and Paul E

    Lee J. Cronbach and Paul E. Meehl. 1955. Construct validity in psychological tests.Psychological Bulletin52, 4 (1955), 281–302. doi:10.1037/h0040957 Place: US Publisher: American Psychological Association

  21. [21]

    1954.The aim and structure of physical theory

    Pierre Maurice Marie Duhem. 1954.The aim and structure of physical theory. Vol. 1. Princeton University Press. Pages: 85-87

  22. [22]

    2025.Deflating Mental Representation (The Jean Nicod Lectures)

    Frances Egan. 2025.Deflating Mental Representation (The Jean Nicod Lectures). MIT Press (open access)

  23. [23]

    Catherine Z. Elgin. 2004. True Enough.Philosophical Issues14 (2004), 113–131. https://www.jstor.org/stable/3050623 Publisher: [Wiley, Ridgeview Publishing Company]

  24. [24]

    Joshua M. Epstein. 2006.Generative Social Science: Studies in Agent-Based Computational Modeling(stu - student edition ed.). Princeton University Press. http://www.jstor.org/stable/j.ctt7rxj1

  25. [25]

    1999.The genetical theory of natural selection: by R.A

    Ronald Aylmer Fisher. 1999.The genetical theory of natural selection: by R.A. Fisher ; edited with a foreword and notes by J.H. Bennett(a complete variorum ed ed.). Oxford University Press, Oxford

  26. [26]

    S3: Social-network simulation system with large language model-empowered agents

    Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2025.S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv:2307.14984 [cs] doi:10.48550/arXiv.2307.14984

  27. [27]

    Martin Gardner. 1970. Mathematical Games.Scientific American223, 4 (1970), 120–123. https://www.jstor.org/stable/24927642 Publisher: Scientific American, a division of Nature America, Inc

  28. [28]

    1979.Reliability and Validity Assessment

    Edward G.Carmines and Richard A.Zeller. 1979.Reliability and Validity Assessment. SAGE Publications, Inc. doi:10.4135/9781412985642

  29. [29]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford

  30. [30]

    doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

    Datasheets for Datasets. doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

  31. [31]

    2017.The New Mechanical Philosophy

    Stuart Glennan. 2017.The New Mechanical Philosophy. Oxford University Press, Oxford

  32. [32]

    Stuart S. Glennan. 1996. Mechanisms and the nature of causation.Erkenntnis44, 1 (Jan. 1996), 49–71. doi:10.1007/BF00172853

  33. [33]

    Claudius Graebner. 2018. How to Relate Models to Reality? An Epistemological Framework for the Validation and Verification of Computational Models.Journal of Artificial Societies and Social Simulation21, 3 (2018), 8

  34. [34]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...

  35. [35]

    Olivia Guest and Iris van Rooij. 2025. Critical Artificial Intelligence Literacy for Psychologists. doi:10.31234/osf.io/dkrgj_v1

  36. [36]

    Fulin Guo. 2023. GPT in Game Theory Experiments. doi:10.48550/arXiv.2305.05516 arXiv:2305.05516 [econ]

  37. [37]

    John J. Horton. 2023. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? doi:10.48550/ arXiv.2301.07543 arXiv:2301.07543 [econ]

  38. [38]

    Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. 2024. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars. doi:10.48550/arXiv.2311.17227 arXiv:2311.17227 [cs]

  39. [39]

    Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman

    Brandon Jackson, B Cavello, Flynn Devine, Nick Garcia, Samuel J. Klein, Alex Krasodomski, Joshua Tan, and Eleanor Tursman. 2024. Public AI: Infrastructure for the common good. doi:10.5281/zenodo.13914560

  40. [40]

    Frank Jackson. 1982. Epiphenomenal Qualia.The Philosophical Quarterly32, 127 (April 1982), 127–136. doi:10.2307/2960077

  41. [41]

    Zhao Kaiya, Michelangelo Naim, Jovana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, and Andrew Ahn. 2023. Lyfe Agents: Generative Agents for Low-Cost Real-Time Social Interactions. arXiv:2310.02172 [cs] doi:10.48550/arXiv.2310.02172

  42. [42]

    David Michael Kaplan and Carl F. Craver. 2011. The Explanatory Force of Dynamical and Mathematical Models in Neuroscience: A Mechanistic Perspective*.Philosophy of Science78, 4 (2011), 601–627. doi:10.1086/661755 Publisher: [The University of Chicago Press, Philosophy of Science Association]

  43. [43]

    Kendrick N. Kay. 2018. Principles for models of neural information processing.NeuroImage180 (Oct. 2018), 101–109. doi:10.1016/j. neuroimage.2017.08.016 Mechanism Plausibility in Generative Agent-Based Modeling FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

  44. [44]

    Benjamin Kempinski, Ian Gemp, Kate Larson, Marc Lanctot, Yoram Bachrach, and Tal Kachman. 2025. Game of Thoughts: Iterative Reasoning in Game-Theoretic Domains with Large Language Models. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’25). International Foundation for Autonomous Agents and Multiagent...

  45. [45]

    J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data.Biometrics33, 1 (March 1977), 159–174

  46. [46]

    Maik Larooij and Petter Törnberg. 2025. Do Large Language Models Solve the Problems of Agent-Based Modeling? A Critical Review of Generative Social Simulations. doi:10.48550/arXiv.2504.03274 arXiv:2504.03274 [cs]

  47. [47]

    Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. Mind the Gap: Assessing Temporal Generalization in Neural Language Models. doi:10.48550/arXiv.2102.01951 arXiv:2102.01951 [cs]

  48. [48]

    Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, and Katia Sycara. 2023. Theory of Mind for Multi-Agent Collaboration via Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 180–192. doi:10.18653/v1/2023.emnlp-main.13 arXiv:2310.10701 [cs]

  49. [49]

    Malthouse

    Xinyi Li, Yu Xu, Yongfeng Zhang, and Edward C. Malthouse. 2024. Large Language Model-driven Multi-Agent Simulation for News Diffusion Under Different Network Structures. doi:10.48550/arXiv.2410.13909 arXiv:2410.13909 [cs]

  50. [50]

    Yuxuan Li, Sauvik Das, and Hirokazu Shirado. 2025. What Makes LLM Agent Simulations Useful for Policy? Insights From an Iterative Design Engagement in Emergency Preparedness. doi:10.48550/arXiv.2509.21868 arXiv:2509.21868 [cs]

  51. [51]

    Yuxuan Li and Hirokazu Shirado. 2025. Spontaneous Giving and Calculated Greed in Language Models. doi:10.48550/arXiv.2502.17720 arXiv:2502.17720 [cs]

  52. [52]

    Yuhan Liu, Xiuying Chen, Xiaoqing Zhang, Xing Gao, Ji Zhang, and Rui Yan. 2024. From Skepticism to Acceptance: Simulating the Attitude Dynamics Toward Fake News. InProceedings of the Thirty-ThirdInternational Joint Conference on Artificial Intelligence. 7849–7857. doi:10.24963/ijcai.2024/873 arXiv:2403.09498 [cs]

  53. [53]

    Kenneth MacCorquodale and Paul E. Meehl. 1948. On a Distinction between Hypothetical Constructs and Intervening Variables. Psychological Review55, 2 (1948), 95–107. doi:10.1037/h0056029

  54. [54]

    Peter Machamer, Lindley Darden, and Carl F. Craver. 2000. Thinking about Mechanisms.Philosophy of Science67, 1 (2000), 1–25. https://www.jstor.org/stable/188611 Publisher: [The University of Chicago Press, Philosophy of Science Association]

  55. [55]

    Giordano De Marzo, Luciano Pietronero, and David Garcia. 2023. Emergence of Scale-Free Networks in Social Interactions among Large Language Models. doi:10.48550/arXiv.2312.06619 arXiv:2312.06619 [physics]

  56. [56]

    Michela Massimi. 2022. Perspectival Ontology: Between Situated Knowledge and Multiculturalism.The Monist105, 2 (March 2022), 214–228. doi:10.1093/monist/onab032

  57. [57]

    Michael D. Mauk. 2000. The potential effectiveness of simulations versus phenomenological models.Nature Neuroscience3, 7 (July 2000), 649–651. doi:10.1038/76606 Publisher: Nature Publishing Group

  58. [58]

    Johansson, Structural and electronic relationships between the lanthanide and actinide elements, Hy- perfine Interactions 128 (2000) 41–66

    James W. McAllister. 1997. Phenomena and Patterns in Data Sets.Erkenntnis (1975-)47, 2 (1997), 217–228. jstor:20012798 doi:10.1023/A: 1005387021520

  59. [59]

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency. 220–229. doi:10.1145/3287560.3287596 arXiv:1810.03993 [cs]

  60. [60]

    Morgan and Margaret Morrison (Eds.)

    Mary S. Morgan and Margaret Morrison (Eds.). 1999.Models as Mediators: Perspectives on Natural and Social Science. Cambridge University Press, Cambridge. doi:10.1017/CBO9780511660108

  61. [61]

    Robert Northcott and Anna Alexandrova. 2015. Prisoner’s Dilemma Doesn’t Explain Much. InThe Prisoner?s Dilemma. Classic philosophical arguments., Martin Peterson (Ed.). Cambridge University Press, 64–84. https://philarchive.org/rec/NORPDD

  62. [62]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. doi:10.48550/arXiv.2304.03442 arXiv:2304.03442 [cs]

  63. [63]

    Bernstein

    Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2022. Social Simulacra: Creating Populated Prototypes for Social Computing Systems.Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(Oct. 2022), 1–18. doi:10.1145/3526113.3545616 Conference Name: UIST ’22: The 3...

  64. [64]

    Wendy S. Parker. 2020. Model Evaluation: An Adequacy-for-Purpose View.Philosophy of Science87, 3 (July 2020), 457–477. doi:10.1086/ 708691

  65. [65]

    Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. 2024. Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning. arXiv. doi:10.48550/ARXIV.2402.13950 Version Number: 4

  66. [66]

    2009.Causality(2 ed.)

    Judea Pearl. 2009.Causality(2 ed.). Cambridge University Press, Cambridge. doi:10.1017/CBO9780511803161

  67. [67]

    2018.The book of why: The new science of cause and effect(1 ed.)

    Judea Pearl and Dana Mackenzie. 2018.The book of why: The new science of cause and effect(1 ed.). Basic Books, Inc., USA

  68. [68]

    Axel Pichler and Nils Reiter. 2022. From Concepts to Texts and Back: Operationalization as a Core Activity of Digital Humanities. Journal of Cultural Analytics7, 4 (Dec. 2022). doi:10.22148/001c.57195 FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Zhao et al

  69. [69]

    1953.From a Logical Point of View

    Willard Van Orman Quine. 1953.From a Logical Point of View. Harvard University Press, Cambridge

  70. [70]

    Siyue Ren, Zhiyao Cui, Ruiqi Song, Zhen Wang, and Shuyue Hu. 2024. Emergence of Social Norms in Generative Agent Societies: Principles and Architecture. doi:10.48550/arXiv.2403.08251 arXiv:2403.08251 [cs]

  71. [71]

    Schelling

    Thomas C. Schelling. 1969. Models of Segregation.The American Economic Review59, 2 (1969), 488–493. https://www.jstor.org/stable/ 1823701 Publisher: American Economic Association

  72. [72]

    Galit Shmueli. 2010. To Explain or to Predict?Statist. Sci.25, 3 (Aug. 2010). doi:10.1214/10-STS330

  73. [73]

    Gareth Polhill, Bruce Edmonds, Petra Ahrweiler, Patrycja Antosz, Geeske Scholz, Emile Chappin, Melania Borit, Harko Verhagen, Francesca Giardini, and Nigel Gilbert

    Flaminio Squazzoni, J. Gareth Polhill, Bruce Edmonds, Petra Ahrweiler, Patrycja Antosz, Geeske Scholz, Emile Chappin, Melania Borit, Harko Verhagen, Francesca Giardini, and Nigel Gilbert. 2020. Computational Models That Matter During a Global Pandemic Outbreak: A Call to Action.JASSS - The Journal of Artificial Societies and Social Simulation23, 2 (March ...

  74. [74]

    S. S. Stevens. 1935. The Operational Definition of Psychological Concepts.Psychological Review42, 6 (1935), 517–527. doi:10.1037/h0056973

  75. [75]

    Samarth Swarup. 2019. Adequacy: What Makes a Simulation Good Enough?. In2019 Spring Simulation Conference (SpringSim). 1–12. doi:10.23919/SpringSim.2019.8732895

  76. [76]

    1910.A Text-Book of Psychology

    Edward Bradford Titchener. 1910.A Text-Book of Psychology. MacMillan Co, New York, NY, US. xx, 565 pages. doi:10.1037/10907-000

  77. [77]

    Loïs Vanhée, Melania Borit, Peer-Olaf Siebers, Roger Cremades, Christopher Frantz, Önder Gürcan, František Kalvas, Denisa Reshef Kera, Vivek Nallur, Kavin Narasimhan, and Martin Neumann. 2025. Large Language Models for Agent-Based Modelling: Current and possible uses across the modelling cycle. doi:10.48550/arXiv.2507.05723 arXiv:2507.05723 [cs] version: 1

  78. [78]

    Elina Vessonen. 2021. Conceptual engineering and operationalism in psychology.Synthese199, 3 (Dec. 2021), 10615–10637. doi:10.1007/ s11229-021-03261-x

  79. [79]

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Hao Sun, Ruihua Song, Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. 2025. User Behavior Simulation with Large Language Model-based Agents.ACM Trans. Inf. Syst.43, 2 (Jan. 2025), 55:1–55:37. doi:10.1145/3708985

  80. [80]

    Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. 2023. Humanoid Agents: Platform for Simulating Human-like Generative Agents. doi:10.48550/arXiv.2310.05418 arXiv:2310.05418 [cs]

Showing first 80 references.