arxiv: 2604.04157 · v1 · submitted 2026-04-05 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents

Hsieh-Ting Lin , Tsung-Yu Hou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords theory of mindLLM agentsemergent behaviormemoryopponent modelingpokerdeceptionsocial cognition

0 comments

The pith

LLM poker agents develop theory-of-mind-like opponent modeling only when given persistent memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether theory-of-mind reasoning can arise in large language models through extended interaction rather than static tests. In repeated Texas Hold'em sessions, agents equipped with memory build predictive and recursive models of opponents, reaching ToM levels 3-5, while agents lacking memory stay at level 0. Memory also enables strategic deception that exploits specific opponents, and the resulting mental models are expressed in readable natural language. Domain knowledge improves how precisely the models are used but does not create the capacity itself. The work therefore shows that functional social cognition can emerge from interaction dynamics alone.

Core claim

Memory is both necessary and sufficient for the emergence of ToM-like behavior in LLM agents. In a 2x2 design, agents with persistent memory reach levels 3-5 (predictive to recursive modeling of opponents) across replications, while agents without memory remain at level 0. Strategic deception grounded in those models appears only in the memory condition, and the models themselves are directly readable as natural-language statements.

What carries the argument

The ToM level classification (0-5) applied to agent actions and statements, which tracks the shift from no opponent modeling to predictive, recursive, and deceptive use of opponent mental states during extended poker play.

If this is right

Agents with ToM deviate from game-theoretic optimality to exploit specific opponents, matching how expert humans play.
Domain expertise is not required for ToM emergence but raises the precision of deception once models exist.
All mental models remain directly readable in natural language, offering a transparent record of the agent's social reasoning.
The same pattern holds across model families, as shown by high agreement between the primary agents and GPT-4o evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding suggests that persistent memory may be a general prerequisite for any interactive social intelligence in current LLMs.
If memory is the critical enabler, then scaling context windows or external memory stores could be a direct route to more human-like social behavior.
The readable natural-language models open the possibility of inspecting and editing an agent's social assumptions in real time.

Load-bearing premise

The hand-coded classification of actions and statements into ToM levels 0-5 truly isolates genuine opponent mental-state modeling rather than surface patterns or prompt effects.

What would settle it

Running the identical poker sessions with memory disabled but with an external opponent-modeling module added, then checking whether ToM levels 3-5 and deception still appear.

Figures

Figures reproduced from arXiv: 2604.04157 by Hsieh-Ting Lin, Tsung-Yu Hou.

**Figure 1.** Figure 1: Real-time spectator interface during a Full condition session (Hand 3, Turn). The interface displays each agent’s cards, chips, and position around the poker table. The Theory of Mind panel (top left) shows real-time ToM level assessments for each agent. Agent memory notes appear as speech bubbles, illustrating the natural-language opponent models that constitute “readable minds.” Statistics panel (left) t… view at source ↗

**Figure 2.** Figure 2: ToM level trajectories across hands. Mean maximum ToM level per hand bin, averaged across five replications. Memory-equipped conditions (Full, No-Skill) show progressive development from Level 0 to Level 3–5, while memory-absent conditions (No-Memory, Baseline) remain at Level 0 throughout. Shaded regions indicate ±1 SD across replications. Strategic deception requires opponent models.. We classified decep… view at source ↗

**Figure 4.** Figure 4: Cross-model inter-rater reliability confusion matrix. Comparison of ToM level codes assigned by Claude Sonnet (rows) and GPT-4o (columns) on a stratified random sample of memory snapshots. The concentration of values on the main diagonal reflects near-perfect agreement (κ = 0.81). All disagreements are between adjacent levels, confirming systematic consistency in the coding rubric. did not develop any oppo… view at source ↗

read the original abstract

Theory of Mind (ToM) -- the ability to model others' mental states -- is fundamental to human social cognition. Whether large language models (LLMs) can develop ToM has been tested exclusively through static vignettes, leaving open whether ToM-like reasoning can emerge through dynamic interaction. Here we report that autonomous LLM agents playing extended sessions of Texas Hold'em poker progressively develop sophisticated opponent models, but only when equipped with persistent memory. In a 2x2 factorial design crossing memory (present/absent) with domain knowledge (present/absent), each with five replications (N = 20 experiments, ~6,000 agent-hand observations), we find that memory is both necessary and sufficient for ToM-like behavior emergence (Cliff's delta = 1.0, p = 0.008). Agents with memory reach ToM Level 3-5 (predictive to recursive modeling), while agents without memory remain at Level 0 across all replications. Strategic deception grounded in opponent models occurs exclusively in memory-equipped conditions (Fisher's exact p < 0.001). Domain expertise does not gate ToM-like behavior emergence but enhances its application: agents without poker knowledge develop equivalent ToM levels but less precise deception (p = 0.004). Agents with ToM deviate from game-theoretically optimal play (67% vs. 79% TAG adherence, delta = -1.0, p = 0.008) to exploit specific opponents, mirroring expert human play. All mental models are expressed in natural language and directly readable, providing a transparent window into AI social cognition. Cross-model validation with GPT-4o yields weighted Cohen's kappa = 0.81 (almost perfect agreement). These findings demonstrate that functional ToM-like behavior can emerge from interaction dynamics alone, without explicit training or prompting, with implications for understanding artificial social intelligence and biological social cognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Memory lets these LLM poker agents generate language that scores higher on a ToM rubric, but the scoring may track prompt context more than actual opponent modeling.

read the letter

The paper's central result is that persistent memory is required for LLM agents to produce natural-language outputs classifiable as ToM levels 3-5 during extended poker play, while agents without memory stay at level 0 across replications. The separation is reported as complete (Cliff's delta 1.0). This moves the discussion from static vignette tests to ongoing interaction, which is a useful step. The 2x2 design with five replications each and roughly 6000 observations gives a clear empirical base, and the cross-model agreement (kappa 0.81) shows the level assignments are at least consistent between scorers. The finding that domain knowledge improves deception precision but not the ToM level itself is also cleanly separated in the data. The deception frequency difference (Fisher's exact p < 0.001) adds a behavioral angle that goes beyond pure language scoring. The soft spot is the classification itself. Because the memory condition supplies the full hand history in the prompt, higher-level attributions could simply reflect richer statistical patterns in the input rather than the agent building and using a recursive model of the opponent's beliefs. The paper does not appear to include an independent check, such as whether the attributed models improve action prediction or exploitation above baseline heuristics. Without that, the perfect split between conditions risks being an artifact of how the rubric reads memory-rich transcripts. The deviation from optimal play is noted but not tied back to measurable gains from the opponent models. This work is aimed at people studying multi-agent LLM systems and comparative social cognition. It is coherent enough on its own terms to warrant peer review, mainly to pressure-test the operationalization of the levels and add controls for prompt leakage.

Referee Report

1 major / 2 minor

Summary. The paper claims that LLM agents playing extended Texas Hold'em poker develop ToM-like behaviors (reaching Levels 3-5 with predictive and recursive opponent modeling) exclusively when equipped with persistent memory. In a 2x2 factorial design (memory present/absent crossed with domain knowledge present/absent) across 20 replications and ~6000 observations, memory is reported as both necessary and sufficient (Cliff's delta=1.0, p=0.008), with strategic deception occurring only in memory conditions (Fisher's exact p<0.001) and all mental models expressed in readable natural language.

Significance. If the classification of ToM levels is valid, the work demonstrates that functional social cognition can emerge in LLMs purely from interaction dynamics and persistent context, without explicit ToM training or prompting. The readable natural-language models provide a rare transparent window into AI reasoning, with implications for artificial social intelligence and comparisons to biological ToM.

major comments (1)

[Methods: ToM Level Classification] Methods section on ToM level rubric and deception detection: The headline result (memory necessary and sufficient for Levels 3-5, perfect separation with Cliff's delta=1.0) depends on the rubric accurately isolating genuine recursive opponent mental-state modeling rather than surface statistical patterns or prompt leakage enabled by memory-rich transcripts. The cross-model kappa=0.81 measures consistency between classifiers but not external validity against behavioral outcomes such as improved action prediction or exploitation success beyond baseline poker heuristics. An independent validation (e.g., correlation between attributed models and actual predictive accuracy in held-out hands) is required to confirm the levels reflect causal modeling.

minor comments (2)

[Abstract] Abstract: The parenthetical '(predictive to recursive modeling)' for Levels 3-5 is helpful but would benefit from a one-sentence definition or reference to the exact rubric criteria used for classification.
[Results] Results: The 67% vs. 79% TAG adherence comparison (delta=-1.0) should specify the exact test statistic and whether it accounts for within-agent dependence across hands.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern about external validation of the ToM level rubric is well-taken, and we address it directly below while clarifying how our existing behavioral results already constrain alternative interpretations such as prompt leakage.

read point-by-point responses

Referee: Methods section on ToM level rubric and deception detection: The headline result (memory necessary and sufficient for Levels 3-5, perfect separation with Cliff's delta=1.0) depends on the rubric accurately isolating genuine recursive opponent mental-state modeling rather than surface statistical patterns or prompt leakage enabled by memory-rich transcripts. The cross-model kappa=0.81 measures consistency between classifiers but not external validity against behavioral outcomes such as improved action prediction or exploitation success beyond baseline poker heuristics. An independent validation (e.g., correlation between attributed models and actual predictive accuracy in held-out hands) is required to confirm the levels reflect causal modeling.

Authors: We agree that inter-rater reliability alone does not establish external validity and that a direct link to behavioral prediction would strengthen the interpretation. The rubric follows the standard five-level developmental ToM hierarchy (Level 0: no mental-state attribution; Level 3: predictive modeling of beliefs; Level 5: recursive embedding of opponent models), applied to the agents' natural-language reasoning traces. While we did not report a held-out prediction correlation in the original submission, the 2x2 design already provides strong behavioral dissociation: only memory-equipped agents reach Levels 3-5, and only those agents exhibit strategic deception (Fisher's exact p<0.001) and systematic deviation from game-theoretic optimal play to exploit specific opponents (67% vs. 79% TAG adherence, Cliff's delta=-1.0, p=0.008). Non-memory agents, which receive identical prompts and domain knowledge but lack persistent context, remain at Level 0 and show neither deception nor exploitation. This pattern is difficult to reconcile with simple prompt leakage or surface statistics, as the memory-absent condition controls for transcript length and prompt content. In the revised manuscript we will add a post-hoc analysis correlating each agent's attributed ToM level with its empirical accuracy at predicting opponent actions on held-out hands within the same sessions, thereby providing the requested external validation metric. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of memory conditions

full rationale

The paper reports results from a 2x2 factorial experiment with LLM poker agents across memory and domain-knowledge conditions. The central claim (memory necessary and sufficient for ToM Levels 3-5) is established by direct observation of agent transcripts, hand-coded or prompted level classification, and non-parametric statistics (Cliff's delta, Fisher's exact tests) on ~6000 observations. No equations, derivations, fitted parameters, or self-citations are invoked to define or predict the outcome; the result is an empirical separation between conditions rather than a tautological reduction. The ToM rubric itself is an external measurement tool applied to generated text and does not presuppose the memory effect.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that observed natural-language opponent descriptions and action deviations constitute ToM-like modeling; no new entities or free parameters are introduced in the abstract.

axioms (1)

domain assumption LLM agents can maintain and use persistent memory across independent game sessions without external scaffolding
Invoked to explain why memory condition produces level 3-5 behavior while no-memory stays at level 0.

pith-pipeline@v0.9.0 · 5649 in / 1124 out tokens · 26685 ms · 2026-05-13T16:51:02.604101+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

memory is both necessary and sufficient for ToM-like behavior emergence (Cliff's delta = 1.0, p = 0.008). Agents with memory reach ToM Level 3-5
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ToM development followed a characteristic progression... Level 5 (Recursive): Agent models opponent’s model of itself

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

(MIT Press, Cambridge, MA)

Dennett DC (1987)The Intentional Stance. (MIT Press, Cambridge, MA)

work page 1987
[2]

Brain Sci.1(4):515–526

Premack D, Woodruff G (1978) Does the chimpanzee have a theory of mind?Behav. Brain Sci.1(4):515–526

work page 1978
[3]

Saxe R (2006) Uniquely human social cognition.Curr. Opin. Neurobiol.16(2):235–239

work page 2006
[4]

Frith CD, Frith U (2006) The neural basis of mentalizing.Neuron50(4):531–534

work page 2006
[5]

Wimmer H, Perner J (1983) Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception.Cognition13(1):103–128

work page 1983
[6]

theory of mind

Baron-Cohen S, Leslie AM, Frith U (1985) Does the autistic child have a “theory of mind”? Cognition21(1):37–46

work page 1985
[7]

Wellman HM, Cross D, Watson J (2001) Meta-analysis of theory-of-mind development: The truth about false belief.Child Dev.72(3):655–684

work page 2001
[8]

Trends Cogn

Call J, Tomasello M (2008) Does the chimpanzee have a theory of mind? 30 years later. Trends Cogn. Sci.12(5):187–192

work page 2008
[9]

Heyes CM, Frith CD (2014) The cultural evolution of mind reading.Science 344(6190):1243091

work page 2014
[10]

theory of mind

Apperly IA (2012) What is “theory of mind”? Concepts, cognitive processes and individual differences.Q. J. Exp. Psychol.65(5):825–839

work page 2012
[11]

Sci.19(2):65–72

Schaafsma SM, Pfaff DW, Spunt RP , Adolphs R (2015) Deconstructing and reconstructing theory of mind.Trends Cogn. Sci.19(2):65–72

work page 2015
[12]

Kosinski M (2024) Evaluating large language models in theory of mind tasks.Proc. Natl. Acad. Sci. U.S.A.121(45):e2405460121

work page 2024
[13]

(2024) Testing theory of mind in large language models and humans

Strachan JWA, et al. (2024) Testing theory of mind in large language models and humans. Nat. Hum. Behav.8:1285–1295

work page 2024
[14]

Mitchell M, Krakauer DC (2023) The debate over understanding in AI’s large language models. Proc. Natl. Acad. Sci. U.S.A.120(13):e2215907120

work page 2023
[15]

Language Models are Few-Shot Learners

Brown TB, et al. (2020) Language models are few-shot learners inAdvances in Neural Information Processing Systems 33. pp. 1877–1901. arXiv:2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck S, et al. (2023) Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

(2022) Chain-of-thought prompting elicits reasoning in large language models in Advances in Neural Information Processing Systems 35

Wei J, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models in Advances in Neural Information Processing Systems 35. pp. 24824–24837

work page 2022
[18]

Kojima T, Gu SS, Reid M, Matsuo Y , Iwasawa Y (2022) Large language models are zero-shot reasoners inAdvances in Neural Information Processing Systems 35. pp. 22199–22213

work page 2022
[19]

Schaeffer R, Miranda B, Koyejo S (2023) Are emergent abilities of large language models a mirage? inAdvances in Neural Information Processing Systems 36. pp. 55565–55581

work page 2023
[20]

Hagendorff T (2024) Deception abilities emerged in large language models.Proc. Natl. Acad. Sci. U.S.A.121(24):e2317967121

work page 2024
[21]

Scheurer J, Balesni M, Kaufmann M (2024) Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590

work page arXiv 2024
[22]

Fraenken JP , Gandhi K, Gerstenberg T, Goodman N (2023) Understanding social reasoning in language models with language models inAdvances in Neural Information Processing Systems 36 (NeurIPS 2023). pp. 13518–13529

work page 2023
[23]

2022 Conf

Sap M, Le Bras R, Fried D, Choi Y (2022) Neural theory-of-mind? On the limits of social intelli- gence in large LMs inProc. 2022 Conf. Empirical Methods in Natural Language Processing (EMNLP). pp. 3762–3780

work page 2022
[24]

Binz M, Schulz E (2023) Using cognitive psychology to understand GPT-3.Proc. Natl. Acad. Sci. U.S.A.120(6):e2218523120

work page 2023
[25]

In-context Learning and Induction Heads

Li K, et al. (2023) Emergent world representations: Exploring a sequence model trained on a synthetic task inProc. 11th Int. Conf. Learning Representations (ICLR). arXiv:2209.11895

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

(2018) Machine theory of mind inProc

Rabinowitz NC, et al. (2018) Machine theory of mind inProc. 35th Int. Conf. Machine Learning (ICML), PMLR. Vol. 80, pp. 4218–4227

work page 2018
[27]

Ullman T (2023) Large language models fail on trivial alterations to theory-of-mind tasks

work page 2023
[28]

(2024) Clever Hans or neural theory of mind? Stress testing social reasoning in large language models inProc

Shapira N, et al. (2024) Clever Hans or neural theory of mind? Stress testing social reasoning in large language models inProc. 18th Conf. European Chapter Assoc. Comput. Linguist. (EACL). pp. 2257–2273

work page 2024
[29]

(2025) Playing repeated games with large language models.Nat

Akata E, et al. (2025) Playing repeated games with large language models.Nat. Hum. Behav. 9(7):1380–1390

work page 2025
[30]

AAAI Conf

Fan C, Chen J, Jin Y , He H (2024) Can large language models serve as rational players in game theory? A systematic analysis inProc. AAAI Conf. Artificial Intelligence. Vol. 38, pp. 17960–17967

work page 2024
[31]

Rep.14(1):18490

Lorè N, Heydari B (2024) Strategic behavior of large language models and the role of game structure versus contextual framing.Sci. Rep.14(1):18490

work page 2024
[32]

Open problems in cooperative ai

Dafoe A, et al. (2020) Open problems in cooperative AI.arXiv preprint arXiv:2012.08630

work page arXiv 2020
[33]

16th Conf

Leibo JZ, Zambaldi V, Lanctot M, Marecki J, Graepel T (2017) Multi-agent reinforcement learning in sequential social dilemmas inProc. 16th Conf. Autonomous Agents and Multiagent Systems (AAMAS). pp. 464–473

work page 2017
[34]

2019 AAAI/ACM Conf

Lerer A, Peysakhovich A (2019) Learning existing social conventions via observationally augmented self-play inProc. 2019 AAAI/ACM Conf. AI, Ethics, and Society (AIES). pp. 107–114

work page 2019
[35]

(2023) Generative agents: Interactive simulacra of human behavior inProc

Park JS, et al. (2023) Generative agents: Interactive simulacra of human behavior inProc. 36th Annual ACM Symp. User Interface Software and Technology. pp. 1–22

work page 2023
[36]

(2023) Discovering language model behaviors with model-written evaluations in Findings of the Association for Computational Linguistics: ACL 2023

Perez E, et al. (2023) Discovering language model behaviors with model-written evaluations in Findings of the Association for Computational Linguistics: ACL 2023. pp. 13387–13434

work page 2023
[37]

(Princeton University Press, Princeton, NJ)

von Neumann J, Morgenstern O (1944)Theory of Games and Economic Behavior. (Princeton University Press, Princeton, NJ)

work page 1944
[38]

Nash JF (1950) Equilibrium points in n-person games.Proc. Natl. Acad. Sci. U.S.A.36(1):48– 49

work page 1950
[39]

Sandholm T (2010) The state of solving large incomplete-information games, and application to poker.AI Magazine31(4):13–32

work page 2010
[40]

(2017) DeepStack: Expert-level artificial intelligence in heads-up no-limit poker.Science356(6337):508–513

Moravˇcík M, et al. (2017) DeepStack: Expert-level artificial intelligence in heads-up no-limit poker.Science356(6337):508–513. 6| Linet al

work page 2017
[41]

Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science359(6374):418–424

work page 2018
[42]

Science347(6218):145–149

Bowling M, Burch N, Johanson M, Tammelin O (2015) Heads-up limit hold’em poker is solved. Science347(6218):145–149

work page 2015
[43]

Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker.Science365(6456):885– 890

work page 2019
[44]

(2022) Human-level play in the game of Diplomacy by combining language models with strategic reasoning.Science 378(6624):1067–1074

Meta Fundamental AI Research Diplomacy Team (FAIR), et al. (2022) Human-level play in the game of Diplomacy by combining language models with strategic reasoning.Science 378(6624):1067–1074

work page 2022
[45]

Conmy A, Mavor-Parker A, Lynch A, Heimersheim S, Garriga-Alonso A (2023) Towards automated circuit discovery for mechanistic interpretability inAdvances in Neural Information Processing Systems 36. pp. 16318–16352

work page 2023
[46]

Biometrics33(1):159–174

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics33(1):159–174

work page 1977
[47]

(2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena in Advances in Neural Information Processing Systems 36

Chiang WL, et al. (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena in Advances in Neural Information Processing Systems 36. pp. 46595–46623

work page 2023
[48]

Cohen J (1960) A coefficient of agreement for nominal scales.Educ. Psychol. Meas.20(1):37– 46

work page 1960
[49]

Bull.114(3):494–509

Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions.Psychol. Bull.114(3):494–509. Linet al. PNAS |April 7, 2026| vol. XXX | no. XX |7

work page 1993