EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Dylan Lu; Edward Gunn; Gaowen Liu; Gurusha Juneja; Jayanth Srinivasa; Parth Diwane; Saaket Agashe; William Yang Wang; Xin Eric Wang; Yali Du

arxiv: 2605.09826 · v2 · pith:Q5CGYBZOnew · submitted 2026-05-11 · 💻 cs.AI · cs.MA

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Gurusha Juneja , Dylan Lu , Saaket Agashe , Parth Diwane , Edward Gunn , Jayanth Srinivasa , Gaowen Liu , William Yang Wang

show 2 more authors

Yali Du Xin Eric Wang

This is my paper

Pith reviewed 2026-05-20 23:24 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords Theory of MindEmbodied AIMulti-agent systemsAI benchmarksFunctional ToMPartial observabilityEpistemic reasoning

0 comments

The pith

Frontier models achieve 0% on functional Theory of Mind tasks in embodied settings despite 45% on literal belief questions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EnactToM, an evolving benchmark of 300 multi-agent tasks in a 3D household with partial observability, private information, and constrained communication. It establishes that all seven tested frontier models score zero percent Pass^3 on completing functional tasks that require acting optimally on implicit beliefs. The same models average 45 percent accuracy on direct literal belief probes, revealing a clear gap between verbalizing beliefs and using them for action. A sympathetic reader would care because effective collaboration in shared physical spaces depends on this functional capacity rather than explicit queries.

Core claim

EnactToM demonstrates that current models cannot complete embodied collaborative tasks that depend on tracking and acting upon partners' private information, scoring zero percent Pass^3 on the hard split, in contrast to their average 45 percent accuracy on literal belief probes. Manual analysis attributes 93 percent of failures to epistemic coordination breakdowns such as withheld information and ignored partner constraints.

What carries the argument

EnactToM benchmark of formally verified embodied multi-agent tasks that isolate functional Theory of Mind by requiring optimal action based on inferred epistemic states under partial observability and constrained communication.

If this is right

Literal belief questions alone do not measure the ability to act on inferred knowledge in collaborative physical settings.
Epistemic coordination failures such as withheld information and misallocated messages must be addressed to enable functional Theory of Mind.
The evolving nature of the benchmark allows it to maintain difficulty as models improve on current tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training focused on direct question answering may not produce the internal tracking needed for acting on beliefs in dynamic environments.
The same coordination breakdowns could appear in other multi-agent domains such as robotic teams or virtual assistants.

Load-bearing premise

The 3D household setup with partial observability and constrained communication accurately captures the epistemic demands that require agents to act on implicit beliefs rather than other solvable strategies.

What would settle it

A model achieving greater than 20 percent Pass^3 on the hard split of EnactToM tasks by correctly inferring and using private information without explicit messages would falsify the observed performance gap.

read the original abstract

Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnactToM reports a clean zero on functional task completion for frontier models in embodied settings while they manage 45% on literal belief questions, but the tasks may not force epistemic coordination over other failure modes.

read the letter

The main point for you is that EnactToM reports zero success for seven frontier models on functional task completion in hard embodied scenarios, against 45 percent on literal belief questions. This suggests current agents lack the ability to act on implicit knowledge in collaborative settings. What the paper does is introduce tasks that require agents to complete household goals while tracking what partners know and can do, under partial views and limited messages. They verify each task formally for solvability and the level of epistemic reasoning needed, then analyze failures to link most of them to coordination problems like withholding info or ignoring constraints. The evolving aspect, where new tasks ramp up difficulty, is a practical addition for keeping the benchmark relevant. This approach improves on earlier ToM tests by focusing on action rather than answers, which fits better with robotics and multi-agent work. The concrete failure modes they identify give clear directions for fixes. The softer area is the link between the zero scores and functional ToM specifically. The verification shows solutions exist with full belief tracking, yet it does not demonstrate that all successful policies must use that tracking. Other factors such as low-level control or communication handling could contribute to the failures. The sampled manual analysis helps but leaves some uncertainty about whether the benchmark cleanly measures the target skill. This paper suits researchers building embodied agents for collaboration. Readers interested in benchmarks that test real-world coordination will find the tasks and results worth examining. It has enough structure and evidence to go to serious referees, who can push on the isolation of the capability. I recommend sending it for peer review, with the expectation that revisions will address how the tasks enforce the need for epistemic coordination.

Referee Report

2 major / 2 minor

Summary. The paper introduces EnactToM, an evolving benchmark consisting of 300 embodied multi-agent tasks in a 3D household environment with partial observability, private information, and constrained communication. It evaluates seven frontier models on functional Theory of Mind (acting optimally on implicit beliefs) versus literal belief probes, reporting 0.0% Pass^3 on functional task completion in the hard split versus 45.0% on literal probes. Each task is formally verified for solvability and required epistemic depth; new tasks are generated to increase difficulty as models improve. Manual analysis attributes 93% of sampled failures to epistemic coordination breakdowns.

Significance. If the verification procedures establish that tasks genuinely require functional ToM rather than being solvable by other means, the benchmark supplies a reproducible, falsifiable target for embodied multi-agent systems and highlights a measurable gap between literal and functional ToM performance in current models. The evolving generation mechanism and explicit failure categorization are positive features that could support iterative progress tracking.

major comments (2)

[Benchmark Construction] Benchmark Construction section (or equivalent): The formal verification procedure is described as confirming 'solvability and required epistemic depth,' yet it is not shown that every successful policy must perform epistemic coordination (e.g., via exhaustive search over information partitions or proof that non-epistemic policies fail). Without this necessity argument, the 0.0% Pass^3 result on the hard split could reflect failures in low-level planning, message parsing, or embodiment constraints instead of functional ToM, undermining the central claim that the benchmark isolates the intended capability.
[Evaluation and Results] Evaluation and Results section: The 93% figure from manual failure analysis is post-hoc and based on a sampled subset; the paper should report the sample size, selection criteria, and inter-annotator agreement to establish that the attribution to epistemic coordination is representative rather than anecdotal.

minor comments (2)

[Evaluation] Define 'Pass^3' explicitly on first use, including the precise success criteria and any aggregation across agents or trials.
[Benchmark] Clarify how the hard split is constructed relative to the evolving task generation process and whether it remains fixed or is regenerated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate the changes we will make in revision.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section (or equivalent): The formal verification procedure is described as confirming 'solvability and required epistemic depth,' yet it is not shown that every successful policy must perform epistemic coordination (e.g., via exhaustive search over information partitions or proof that non-epistemic policies fail). Without this necessity argument, the 0.0% Pass^3 result on the hard split could reflect failures in low-level planning, message parsing, or embodiment constraints instead of functional ToM, undermining the central claim that the benchmark isolates the intended capability.

Authors: We appreciate the referee's observation that a stronger necessity argument would more rigorously isolate functional ToM. Our verification procedure already enumerates information partitions to confirm that specific belief updates are required for solvability and that tasks demand a minimum epistemic depth; non-epistemic policies are ruled out at the verification stage because they lead to unsatisfiable goal conditions under the enumerated partitions. Nevertheless, we agree that making this explicit with additional examples and proof sketches would address potential alternative explanations such as low-level planning failures. We will expand the Benchmark Construction section accordingly in the revised manuscript. revision: yes
Referee: [Evaluation and Results] Evaluation and Results section: The 93% figure from manual failure analysis is post-hoc and based on a sampled subset; the paper should report the sample size, selection criteria, and inter-annotator agreement to establish that the attribution to epistemic coordination is representative rather than anecdotal.

Authors: We agree that greater transparency is needed for the manual failure analysis. We will revise the Evaluation and Results section to report the sample size of analyzed failures, the criteria used for selecting the subset (random sampling from the pool of failed trajectories), and inter-annotator agreement metrics. These additions will substantiate that the 93% attribution to epistemic coordination breakdowns is representative. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fits

full rationale

The paper introduces EnactToM as an empirical benchmark of 300 tasks with formal verification for solvability and epistemic depth, followed by direct model evaluations (0.0% Pass^3 functional vs 45.0% literal on hard split). No mathematical derivation chain, parameter fitting, predictions, or first-principles results are claimed or present. Task verification is a methodological assertion, not a self-referential definition or fitted input renamed as output. No self-citations are load-bearing for any central result, and the 93% failure analysis is post-hoc sampling. The work is self-contained against external benchmarks and model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is a benchmark construction paper rather than a derivation; it relies on domain assumptions about epistemic states and environment modeling but introduces no free parameters, new entities, or ad-hoc axioms beyond standard AI evaluation practices.

axioms (1)

domain assumption Tasks can be formally verified for solvability and required epistemic depth
Stated directly in the abstract as a property of the benchmark tasks.

pith-pipeline@v0.9.0 · 5738 in / 1148 out tokens · 64624 ms · 2026-05-20T23:24:28.788730+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each task is formally verified for solvability and required epistemic depth... K-depth 1/2/3... PDDL goal formula combining physical predicates with K-operators
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 1 internal anchor

[1]

Does the chimpanzee have a theory of mind?Behavioral and Brain Sciences, 1(4):515–526, 1978

David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind?Behavioral and Brain Sciences, 1(4):515–526, 1978

work page 1978
[2]

Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception.Cognition, 13(1):103–128, 1983

Heinz Wimmer and Josef Perner. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception.Cognition, 13(1):103–128, 1983. doi: 10.1016/0010-0277(83)90004-5

work page doi:10.1016/0010-0277(83)90004-5 1983
[3]

Understanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Understanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

work page 2005
[4]

Weisz, and Murray Campbell

Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, and Murray Campbell. Position: Theory of mind benchmarks are broken for large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd Internationa...

work page 2025
[5]

Revisiting the evaluation of theory of mind through question answering.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5872–5877, 2019

Matthew Le, Y-Lan Boureau, and Maximilian Nickel. Revisiting the evaluation of theory of mind through question answering.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5872–5877, 2019. doi: 10.18653/v1/D19-1598

work page doi:10.18653/v1/d19-1598 2019
[6]

Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, Singapore, December 2023. Associatio...

work page doi:10.18653/v1/2023.findings-emnlp.717 2023
[7]

Understanding social reasoning in language models with language models

Kanishk Gandhi, Jan-Philipp Fraenken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 13518–13529. Curran Associates, Inc., 2023. URLhttps://neurip...

work page 2023
[8]

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. FANToM: A benchmark for stress-testing machine theory of mind in interactions.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023. 11 EnactToM

work page 2023
[9]

Explore theory of mind: program-guided adversarial data generation for theory of mind reasoning

Melanie Sclar, Jane Dwivedi-Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, and Asli Celikyilmaz. Explore theory of mind: program-guided adversarial data generation for theory of mind reasoning. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 67635–67660, ...

work page 2025
[10]

OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1: LongPapers), pages8593–8623,...

work page doi:10.18653/v1/2024.acl-long.466 2024
[11]

MMToM-QA: Multimodal theory of mind question answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, and Tianmin Shu. MMToM-QA: Multimodal theory of mind question answering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

work page doi:10.18653/v1/2024.acl-long.851 2024
[12]

Muma- tom: Multi-modalmulti-agenttheoryofmind.ProceedingsoftheAAAIConferenceonArtificialIntelligence, 39(2):1510–1519, Apr

Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, and Tianmin Shu. Muma- tom: Multi-modalmulti-agenttheoryofmind.ProceedingsoftheAAAIConferenceonArtificialIntelligence, 39(2):1510–1519, Apr. 2025. doi: 10.1609/aaai.v39i2.32142. URLhttps://ojs.aaai.org/ind ex.php/AAAI/article/view/32142

work page doi:10.1609/aaai.v39i2.32142 2025
[13]

McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, and Manaal Faruqui

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, and Manaal Faruqui. How far are large language models from agents with theory-of-mind?, 2023. URLhttps://arxiv. org/abs/2310.03051

work page arXiv 2023
[14]

Simpletom: Exposing the gap between explicit tom inference and implicit tom application in llms,

Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark, and Yejin Choi. Simpletom: Exposing the gap between explicit tom inference and implicit tom application in llms,

work page
[15]

URLhttps://arxiv.org/abs/2410.13648

work page arXiv
[16]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. InAdvances in Neural Information Processing Systems, volume 34, pages 251–266, 2021

work page 2021
[17]

PARTNR: A benchmark for planning and reasoning in embodied multi-agent tasks

Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John M Turner, Eric Undersander, and Tsung-Yen Yang. PARTNR: A benchmark for planning a...

work page 2025
[18]

TEACh: Task-driven embodied agents that chat.Proceedings of the AAAI Conference on Artificial Intelligence, 36(2):2017– 2025, June 2022

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-driven embodied agents that chat.Proceedings of the AAAI Conference on Artificial Intelligence, 36(2):2017– 2025, June 2022. doi: 10.1609/aaai.v36i2.20097. 12 EnactToM

work page doi:10.1609/aaai.v36i2.20097 2017
[19]

Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H

Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling. The Hanabi challenge: A new frontier for AI research.Artificial Intelligence, 280:103216, 2020. ISSN 0004-3702. d...

work page doi:10.1016/j.artint.2019.1032 2020
[20]

SOTOPIA: Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA: Interactive evaluation for social intelligence in language agents. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=mM7VurbA4r

work page 2024
[21]

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo...

work page arXiv 2026
[22]

Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, November 2022

Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, November 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-34591-0

work page doi:10.1038/s41467-022-34591-0 2022
[23]

Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks

Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore, Decem...

work page doi:10.18653/v1/2023.emnlp-main.308 2023
[24]

Time travel in LLMs: Tracing data contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2Rwq6c3tvr

work page 2024
[25]

Dynabench: Rethinking benchmarking in nlp

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. In Kristina Toutanova, A...

work page doi:10.18653/v1/2021.naacl-main.324 2021
[26]

Livebench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited LLM benchmark. In...

work page 2025
[27]

Experimental evidence on players’ models of other players.Journal of Economic Behavior & Organization, 25(3):309–327, 1994

Dale O Stahl and Paul W Wilson. Experimental evidence on players’ models of other players.Journal of Economic Behavior & Organization, 25(3):309–327, 1994

work page 1994
[28]

A cognitive hierarchy model of games.The Quarterly Journal of Economics, 119(3):861–898, 2004

Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games.The Quarterly Journal of Economics, 119(3):861–898, 2004

work page 2004
[29]

emnlp-main.394/

John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50528–50652...

work page doi:10.52202/079017-1601 2024
[30]

doi: 10.1073/pnas.2405460121

Michal Kosinski. Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024. doi: 10.1073/pnas.2405460121. URL https://www.pnas.org/doi/abs/10.1073/pnas.2405460121

work page doi:10.1073/pnas.2405460121 2024
[31]

Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023

Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks, 2023. URL https://arxiv.org/abs/2302.08399

work page arXiv 2023
[32]

Clever hans or neural theory of mind? stress testing social reasoning in large language models

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational ...

work page doi:10.18653/v1/2024.eacl-long.138 2024
[33]

Neural theory-of-mind? on the limits of social intelligence in large LMs

Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of social intelligence in large LMs. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3762–3780, Abu Dhabi, United Arab Emirates, December 2022. Association ...

work page 2022
[34]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, KianaEhsani,DanielGordon,YukeZhu,AniruddhaKembhavi,AbhinavGupta,andAliFarhadi. Ai2-thor: An interactive 3d environment for visual ai, 2017. URLhttps://arxiv.org/abs/1712.05474

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

VirtualHome: SimulatingHouseholdActivitiesViaPrograms

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. VirtualHome: SimulatingHouseholdActivitiesViaPrograms. In2018IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pages 8494–8502, Los Alamitos, CA, USA, June 2018. IEEE Computer Society. doi: 10.1109/CVPR.2018.00886. URLhttps://doi.ieeecomputers...

work page doi:10.1109/cvpr.2018.00886 2018
[36]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 14 EnactToM

work page 2020
[37]

Ho, Thomas L

Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination, 2020. URL https://arxiv.org/abs/1910.05789

work page arXiv 2020
[38]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701320. doi:...

work page doi:10.1145/3586183.3606763 2023
[39]

Masangkay, Kathleen A

Zenaida S. Masangkay, Kathleen A. McCluskey, Curtis W. McIntyre, Judith Sims-Knight, Brian E. Vaughn, and John H. Flavell. The early development of inferences about the visual percepts of others. Child Development, 45(2):357–366, 1974. ISSN 00093920, 14678624. doi: 10.2307/1127956. URL http://www.jstor.org/stable/1127956

work page doi:10.2307/1127956 1974
[40]

John H. Flavell. The development of knowledge about visual perception. InNebraska Symposium on Motivation, volume 25, pages 43–76. University of Nebraska Press, Lincoln, NE, 1977

work page 1977
[41]

Flavell, Barbara A

John H. Flavell, Barbara A. Everett, Karen Croft, and Eleanor R. Flavell. Young children’s knowledge about visual perception: Further evidence for the Level 1–Level 2 distinction.Developmental Psychology, 17(1):99–103, 1981. doi: 10.1037/0012-1649.17.1.99. URLhttps://doi.org/10.1037/0012-1 649.17.1.99

work page doi:10.1037/0012-1649.17.1.99 1981
[42]

John H. Flavell. Perspectives on perspective taking. In Harry Beilin and Peter B. Pufall, editors,Piaget’s Theory: Prospects and Possibilities, pages 107–139. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992

work page 1992
[43]

Flavell, Susan G

John H. Flavell, Susan G. Shipstead, and Karen Croft. Young children’s knowledge about visual perception: Hiding objects from others.Child Development, 49(4):1208–1211, 1978. ISSN 00093920, 14678624. URLhttp://www.jstor.org/stable/1128761

work page arXiv 1978
[44]

theory of mind

Simon Baron-Cohen, Alan M. Leslie, and Uta Frith. Does the autistic child have a “theory of mind” ? Cognition, 21(1):37–46, 1985. ISSN 0010-0277. doi: 10.1016/0010-0277(85)90022-8

work page doi:10.1016/0010-0277(85)90022-8 1985
[45]

Meta-analysis of theory-of-mind development: The truth about false belief.Child Development, 72(3):655–684, 05 2001

Henry M Wellman, David Cross, and Julanne Watson. Meta-analysis of theory-of-mind development: The truth about false belief.Child Development, 72(3):655–684, 05 2001. ISSN 0009-3920. doi: 10.1111/1467-8624.00304. URLhttps://doi.org/10.1111/1467-8624.00304

work page doi:10.1111/1467-8624.00304 2001
[46]

Scaling of theory-of-mind tasks.Child Development, 75(2):523–541, March 2004

Henry M Wellman and David Liu. Scaling of theory-of-mind tasks.Child Development, 75(2):523–541, March 2004. ISSN 0009-3920. doi: 10.1111/j.1467-8624.2004.00691.x

work page doi:10.1111/j.1467-8624.2004.00691.x 2004
[47]

John thinks that Mary thinks that

Josef Perner and Heinz Wimmer. “John thinks that Mary thinks that...” attribution of second-order beliefs by 5- to 10-year-old children.Journal of Experimental Child Psychology, 39(3):437–471, 1985. ISSN 0022-0965. doi: 10.1016/0022-0965(85)90051-7

work page doi:10.1016/0022-0965(85)90051-7 1985
[48]

Apperly and Stephen A

Ian A. Apperly and Stephen A. Butterfill. Do humans have two systems to track beliefs and belief- like states?Psychological Review, 116(4):953–970, 2009. doi: 10.1037/a0016923. URL https: //doi.org/10.1037/a0016923

work page doi:10.1037/a0016923 2009
[49]

Onishi and Renée Baillargeon

Kristine H. Onishi and Renée Baillargeon. Do 15-month-old infants understand false beliefs?Science, 308(5719):255–258, 2005. doi: 10.1126/science.1107621. URLhttps://www.science.org/do i/abs/10.1126/science.1107621. 15 EnactToM

work page doi:10.1126/science.1107621 2005
[50]

Limits on theory of mind use in adults.Cognition, 89(1): 25–41, 2003

Boaz Keysar, Shuhong Lin, and Dale J Barr. Limits on theory of mind use in adults.Cognition, 89(1): 25–41, 2003. ISSN 0010-0277. doi: 10.1016/S0010-0277(03)00064-7

work page doi:10.1016/s0010-0277(03)00064-7 2003
[51]

Daniel C. Dennett. Beliefs about beliefs.Behavioral and Brain Sciences, 1(4):568–570, 1978. doi: 10.1017/S0140525X00076664

work page doi:10.1017/s0140525x00076664 1978
[52]

The Theory of Industrial Organization

Daniel C. Dennett.The Intentional Stance. MIT Press, Cambridge, MA, 1987. URLhttps://mitpre ss.mit.edu/9780262540537/the-intentional-stance/

work page arXiv 1987
[53]

Baker, Rebecca Saxe, and Joshua B

Chris L. Baker, Rebecca Saxe, and Joshua B. Tenenbaum. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009. ISSN 0010-0277. doi: 10.1016/j.cognition.2009.07.005

work page doi:10.1016/j.cognition.2009.07.005 2009
[54]

Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B

Chris L. Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B. Tenenbaum. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing.Nature Human Behaviour, 1:0064,

work page
[55]

URLhttps://doi.org/10.1038/s41562-017-0064

doi: 10.1038/s41562-017-0064. URLhttps://doi.org/10.1038/s41562-017-0064

work page doi:10.1038/s41562-017-0064
[56]

Shared cooperative activity.The Philosophical Review, 101(2):327–341, 1992

Michael E Bratman. Shared cooperative activity.The Philosophical Review, 101(2):327–341, 1992

work page 1992
[57]

A minimal architecture for joint action.Neural Networks, 23(8):998–1003, 2010

Cordula Vesper, Stephen Butterfill, Günther Knoblich, and Natalie Sebanz. A minimal architecture for joint action.Neural Networks, 23(8):998–1003, 2010. ISSN 0893-6080. doi: 10.1016/j.neunet.2010. 06.002

work page doi:10.1016/j.neunet.2010 2010
[58]

Robert J. Aumann. Agreeing to disagree.The Annals of Statistics, 4(6):1236–1239, 1976. ISSN 00905364, 21688966. URLhttp://www.jstor.org/stable/2958591

work page arXiv 1976
[59]

Robert J. Aumann. Interactive epistemology II: Probability.International Journal of Game Theory, 28 (3):301–314, 1999. doi: 10.1007/s001820050112. URLhttps://doi.org/10.1007/s0018200 50112

work page doi:10.1007/s001820050112 1999
[60]

Stahl and Paul W

Dale O. Stahl and Paul W. Wilson. On players′ models of other players: Theory and experimental evidence.Games and Economic Behavior, 10(1):218–254, 1995. ISSN 0899-8256. doi: 10.1006/game .1995.1031

work page doi:10.1006/game 1995
[61]

Unraveling in guessing games: An experimental study.The American Economic Review, 85(5):1313–1326, 1995

Rosemarie Nagel. Unraveling in guessing games: An experimental study.The American Economic Review, 85(5):1313–1326, 1995. ISSN 00028282. URLhttp://www.jstor.org/stable/2950991

work page arXiv 1995
[62]

hide-and-seek

Vincent P. Crawford and Nagore Iriberri. Fatal attraction: Salience, naïveté, and sophistication in experimental “hide-and-seek” games.American Economic Review, 97(5):1731–1750, 2007. doi: 10.1257/aer.97.5.1731. URLhttps://doi.org/10.1257/aer.97.5.1731. 16 EnactToM Figure 3:Cumulative tasks generated with and without ICL seed examples. With ICL (seed task...

work page doi:10.1257/aer.97.5.1731 2007
[63]

scene_id

Another agent openscabinet_34. If Fast Downward finds this plan (or any valid alternative), the task is provably solvable. TheK-depth of 2 is read directly from the nesting structure during Step 1. F. Task generation agent workspace and prompt The generation agent operates in an isolated workspace directory: 1workspace/ 2working_task.json # task being aut...

work page
[64]

new_scene[N]→load scene

work page
[65]

Inspect seed tasks in sampled_tasks/ for inspiration

work page
[66]

Do NOT hand-author :objects or :init

Edit working_task.json: author the problem_pddl :goal FIRST, then write task, agent_secrets, and mechanic bindings to match. Do NOT hand-author :objects or :init

work page
[67]

judge[]→fix→repeat until pass

work page
[68]

test_task[]→reject tasks that fail with full information

work page
[69]

Wait for agent_3 to tell you whether stand_34 is open, then forward that to agent_0

submit_task[]. Core rules. – Author the PDDL goal as the source of truth; write narrative to match it. – Secrets state WHAT (room restrictions, target IDs, mechanic hints) but NEVER HOW (no coordination strategy, no relay instructions). – Every agent must make a distinct, non-substitutable contribution. – At least one physical action must be information-d...

work page
[70]

intentional stance

argued that the critical test of ToM is attribution of false beliefs, and [51] formalised the “intentional stance” — predicting behaviour by attributing beliefs and rational agency. [52] computationally formalised this as Bayesian inverse planning, later extended to joint inference over beliefs, desires, and percepts [53]. On the coordination side, [54] a...

work page

[1] [1]

Does the chimpanzee have a theory of mind?Behavioral and Brain Sciences, 1(4):515–526, 1978

David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind?Behavioral and Brain Sciences, 1(4):515–526, 1978

work page 1978

[2] [2]

Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception.Cognition, 13(1):103–128, 1983

Heinz Wimmer and Josef Perner. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception.Cognition, 13(1):103–128, 1983. doi: 10.1016/0010-0277(83)90004-5

work page doi:10.1016/0010-0277(83)90004-5 1983

[3] [3]

Understanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Understanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

work page 2005

[4] [4]

Weisz, and Murray Campbell

Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, and Murray Campbell. Position: Theory of mind benchmarks are broken for large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd Internationa...

work page 2025

[5] [5]

Revisiting the evaluation of theory of mind through question answering.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5872–5877, 2019

Matthew Le, Y-Lan Boureau, and Maximilian Nickel. Revisiting the evaluation of theory of mind through question answering.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 5872–5877, 2019. doi: 10.18653/v1/D19-1598

work page doi:10.18653/v1/d19-1598 2019

[6] [6]

Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, Singapore, December 2023. Associatio...

work page doi:10.18653/v1/2023.findings-emnlp.717 2023

[7] [7]

Understanding social reasoning in language models with language models

Kanishk Gandhi, Jan-Philipp Fraenken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 13518–13529. Curran Associates, Inc., 2023. URLhttps://neurip...

work page 2023

[8] [8]

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. FANToM: A benchmark for stress-testing machine theory of mind in interactions.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023. 11 EnactToM

work page 2023

[9] [9]

Explore theory of mind: program-guided adversarial data generation for theory of mind reasoning

Melanie Sclar, Jane Dwivedi-Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, and Asli Celikyilmaz. Explore theory of mind: program-guided adversarial data generation for theory of mind reasoning. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 67635–67660, ...

work page 2025

[10] [10]

OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1: LongPapers), pages8593–8623,...

work page doi:10.18653/v1/2024.acl-long.466 2024

[11] [11]

MMToM-QA: Multimodal theory of mind question answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, and Tianmin Shu. MMToM-QA: Multimodal theory of mind question answering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

work page doi:10.18653/v1/2024.acl-long.851 2024

[12] [12]

Muma- tom: Multi-modalmulti-agenttheoryofmind.ProceedingsoftheAAAIConferenceonArtificialIntelligence, 39(2):1510–1519, Apr

Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, and Tianmin Shu. Muma- tom: Multi-modalmulti-agenttheoryofmind.ProceedingsoftheAAAIConferenceonArtificialIntelligence, 39(2):1510–1519, Apr. 2025. doi: 10.1609/aaai.v39i2.32142. URLhttps://ojs.aaai.org/ind ex.php/AAAI/article/view/32142

work page doi:10.1609/aaai.v39i2.32142 2025

[13] [13]

McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, and Manaal Faruqui

Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, and Manaal Faruqui. How far are large language models from agents with theory-of-mind?, 2023. URLhttps://arxiv. org/abs/2310.03051

work page arXiv 2023

[14] [14]

Simpletom: Exposing the gap between explicit tom inference and implicit tom application in llms,

Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark, and Yejin Choi. Simpletom: Exposing the gap between explicit tom inference and implicit tom application in llms,

work page

[15] [15]

URLhttps://arxiv.org/abs/2410.13648

work page arXiv

[16] [16]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. InAdvances in Neural Information Processing Systems, volume 34, pages 251–266, 2021

work page 2021

[17] [17]

PARTNR: A benchmark for planning and reasoning in embodied multi-agent tasks

Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John M Turner, Eric Undersander, and Tsung-Yen Yang. PARTNR: A benchmark for planning a...

work page 2025

[18] [18]

TEACh: Task-driven embodied agents that chat.Proceedings of the AAAI Conference on Artificial Intelligence, 36(2):2017– 2025, June 2022

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-driven embodied agents that chat.Proceedings of the AAAI Conference on Artificial Intelligence, 36(2):2017– 2025, June 2022. doi: 10.1609/aaai.v36i2.20097. 12 EnactToM

work page doi:10.1609/aaai.v36i2.20097 2017

[19] [19]

Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H

Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling. The Hanabi challenge: A new frontier for AI research.Artificial Intelligence, 280:103216, 2020. ISSN 0004-3702. d...

work page doi:10.1016/j.artint.2019.1032 2020

[20] [20]

SOTOPIA: Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA: Interactive evaluation for social intelligence in language agents. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=mM7VurbA4r

work page 2024

[21] [21]

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo...

work page arXiv 2026

[22] [22]

Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, November 2022

Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, November 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-34591-0

work page doi:10.1038/s41467-022-34591-0 2022

[23] [23]

Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks

Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore, Decem...

work page doi:10.18653/v1/2023.emnlp-main.308 2023

[24] [24]

Time travel in LLMs: Tracing data contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2Rwq6c3tvr

work page 2024

[25] [25]

Dynabench: Rethinking benchmarking in nlp

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. In Kristina Toutanova, A...

work page doi:10.18653/v1/2021.naacl-main.324 2021

[26] [26]

Livebench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited LLM benchmark. In...

work page 2025

[27] [27]

Experimental evidence on players’ models of other players.Journal of Economic Behavior & Organization, 25(3):309–327, 1994

Dale O Stahl and Paul W Wilson. Experimental evidence on players’ models of other players.Journal of Economic Behavior & Organization, 25(3):309–327, 1994

work page 1994

[28] [28]

A cognitive hierarchy model of games.The Quarterly Journal of Economics, 119(3):861–898, 2004

Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games.The Quarterly Journal of Economics, 119(3):861–898, 2004

work page 2004

[29] [29]

emnlp-main.394/

John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50528–50652...

work page doi:10.52202/079017-1601 2024

[30] [30]

doi: 10.1073/pnas.2405460121

Michal Kosinski. Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024. doi: 10.1073/pnas.2405460121. URL https://www.pnas.org/doi/abs/10.1073/pnas.2405460121

work page doi:10.1073/pnas.2405460121 2024

[31] [31]

Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023

Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks, 2023. URL https://arxiv.org/abs/2302.08399

work page arXiv 2023

[32] [32]

Clever hans or neural theory of mind? stress testing social reasoning in large language models

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational ...

work page doi:10.18653/v1/2024.eacl-long.138 2024

[33] [33]

Neural theory-of-mind? on the limits of social intelligence in large LMs

Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of social intelligence in large LMs. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3762–3780, Abu Dhabi, United Arab Emirates, December 2022. Association ...

work page 2022

[34] [34]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, KianaEhsani,DanielGordon,YukeZhu,AniruddhaKembhavi,AbhinavGupta,andAliFarhadi. Ai2-thor: An interactive 3d environment for visual ai, 2017. URLhttps://arxiv.org/abs/1712.05474

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

VirtualHome: SimulatingHouseholdActivitiesViaPrograms

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. VirtualHome: SimulatingHouseholdActivitiesViaPrograms. In2018IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pages 8494–8502, Los Alamitos, CA, USA, June 2018. IEEE Computer Society. doi: 10.1109/CVPR.2018.00886. URLhttps://doi.ieeecomputers...

work page doi:10.1109/cvpr.2018.00886 2018

[36] [36]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 14 EnactToM

work page 2020

[37] [37]

Ho, Thomas L

Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination, 2020. URL https://arxiv.org/abs/1910.05789

work page arXiv 2020

[38] [38]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701320. doi:...

work page doi:10.1145/3586183.3606763 2023

[39] [39]

Masangkay, Kathleen A

Zenaida S. Masangkay, Kathleen A. McCluskey, Curtis W. McIntyre, Judith Sims-Knight, Brian E. Vaughn, and John H. Flavell. The early development of inferences about the visual percepts of others. Child Development, 45(2):357–366, 1974. ISSN 00093920, 14678624. doi: 10.2307/1127956. URL http://www.jstor.org/stable/1127956

work page doi:10.2307/1127956 1974

[40] [40]

John H. Flavell. The development of knowledge about visual perception. InNebraska Symposium on Motivation, volume 25, pages 43–76. University of Nebraska Press, Lincoln, NE, 1977

work page 1977

[41] [41]

Flavell, Barbara A

John H. Flavell, Barbara A. Everett, Karen Croft, and Eleanor R. Flavell. Young children’s knowledge about visual perception: Further evidence for the Level 1–Level 2 distinction.Developmental Psychology, 17(1):99–103, 1981. doi: 10.1037/0012-1649.17.1.99. URLhttps://doi.org/10.1037/0012-1 649.17.1.99

work page doi:10.1037/0012-1649.17.1.99 1981

[42] [42]

John H. Flavell. Perspectives on perspective taking. In Harry Beilin and Peter B. Pufall, editors,Piaget’s Theory: Prospects and Possibilities, pages 107–139. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992

work page 1992

[43] [43]

Flavell, Susan G

John H. Flavell, Susan G. Shipstead, and Karen Croft. Young children’s knowledge about visual perception: Hiding objects from others.Child Development, 49(4):1208–1211, 1978. ISSN 00093920, 14678624. URLhttp://www.jstor.org/stable/1128761

work page arXiv 1978

[44] [44]

theory of mind

Simon Baron-Cohen, Alan M. Leslie, and Uta Frith. Does the autistic child have a “theory of mind” ? Cognition, 21(1):37–46, 1985. ISSN 0010-0277. doi: 10.1016/0010-0277(85)90022-8

work page doi:10.1016/0010-0277(85)90022-8 1985

[45] [45]

Meta-analysis of theory-of-mind development: The truth about false belief.Child Development, 72(3):655–684, 05 2001

Henry M Wellman, David Cross, and Julanne Watson. Meta-analysis of theory-of-mind development: The truth about false belief.Child Development, 72(3):655–684, 05 2001. ISSN 0009-3920. doi: 10.1111/1467-8624.00304. URLhttps://doi.org/10.1111/1467-8624.00304

work page doi:10.1111/1467-8624.00304 2001

[46] [46]

Scaling of theory-of-mind tasks.Child Development, 75(2):523–541, March 2004

Henry M Wellman and David Liu. Scaling of theory-of-mind tasks.Child Development, 75(2):523–541, March 2004. ISSN 0009-3920. doi: 10.1111/j.1467-8624.2004.00691.x

work page doi:10.1111/j.1467-8624.2004.00691.x 2004

[47] [47]

John thinks that Mary thinks that

Josef Perner and Heinz Wimmer. “John thinks that Mary thinks that...” attribution of second-order beliefs by 5- to 10-year-old children.Journal of Experimental Child Psychology, 39(3):437–471, 1985. ISSN 0022-0965. doi: 10.1016/0022-0965(85)90051-7

work page doi:10.1016/0022-0965(85)90051-7 1985

[48] [48]

Apperly and Stephen A

Ian A. Apperly and Stephen A. Butterfill. Do humans have two systems to track beliefs and belief- like states?Psychological Review, 116(4):953–970, 2009. doi: 10.1037/a0016923. URL https: //doi.org/10.1037/a0016923

work page doi:10.1037/a0016923 2009

[49] [49]

Onishi and Renée Baillargeon

Kristine H. Onishi and Renée Baillargeon. Do 15-month-old infants understand false beliefs?Science, 308(5719):255–258, 2005. doi: 10.1126/science.1107621. URLhttps://www.science.org/do i/abs/10.1126/science.1107621. 15 EnactToM

work page doi:10.1126/science.1107621 2005

[50] [50]

Limits on theory of mind use in adults.Cognition, 89(1): 25–41, 2003

Boaz Keysar, Shuhong Lin, and Dale J Barr. Limits on theory of mind use in adults.Cognition, 89(1): 25–41, 2003. ISSN 0010-0277. doi: 10.1016/S0010-0277(03)00064-7

work page doi:10.1016/s0010-0277(03)00064-7 2003

[51] [51]

Daniel C. Dennett. Beliefs about beliefs.Behavioral and Brain Sciences, 1(4):568–570, 1978. doi: 10.1017/S0140525X00076664

work page doi:10.1017/s0140525x00076664 1978

[52] [52]

The Theory of Industrial Organization

Daniel C. Dennett.The Intentional Stance. MIT Press, Cambridge, MA, 1987. URLhttps://mitpre ss.mit.edu/9780262540537/the-intentional-stance/

work page arXiv 1987

[53] [53]

Baker, Rebecca Saxe, and Joshua B

Chris L. Baker, Rebecca Saxe, and Joshua B. Tenenbaum. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009. ISSN 0010-0277. doi: 10.1016/j.cognition.2009.07.005

work page doi:10.1016/j.cognition.2009.07.005 2009

[54] [54]

Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B

Chris L. Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B. Tenenbaum. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing.Nature Human Behaviour, 1:0064,

work page

[55] [55]

URLhttps://doi.org/10.1038/s41562-017-0064

doi: 10.1038/s41562-017-0064. URLhttps://doi.org/10.1038/s41562-017-0064

work page doi:10.1038/s41562-017-0064

[56] [56]

Shared cooperative activity.The Philosophical Review, 101(2):327–341, 1992

Michael E Bratman. Shared cooperative activity.The Philosophical Review, 101(2):327–341, 1992

work page 1992

[57] [57]

A minimal architecture for joint action.Neural Networks, 23(8):998–1003, 2010

Cordula Vesper, Stephen Butterfill, Günther Knoblich, and Natalie Sebanz. A minimal architecture for joint action.Neural Networks, 23(8):998–1003, 2010. ISSN 0893-6080. doi: 10.1016/j.neunet.2010. 06.002

work page doi:10.1016/j.neunet.2010 2010

[58] [58]

Robert J. Aumann. Agreeing to disagree.The Annals of Statistics, 4(6):1236–1239, 1976. ISSN 00905364, 21688966. URLhttp://www.jstor.org/stable/2958591

work page arXiv 1976

[59] [59]

Robert J. Aumann. Interactive epistemology II: Probability.International Journal of Game Theory, 28 (3):301–314, 1999. doi: 10.1007/s001820050112. URLhttps://doi.org/10.1007/s0018200 50112

work page doi:10.1007/s001820050112 1999

[60] [60]

Stahl and Paul W

Dale O. Stahl and Paul W. Wilson. On players′ models of other players: Theory and experimental evidence.Games and Economic Behavior, 10(1):218–254, 1995. ISSN 0899-8256. doi: 10.1006/game .1995.1031

work page doi:10.1006/game 1995

[61] [61]

Unraveling in guessing games: An experimental study.The American Economic Review, 85(5):1313–1326, 1995

Rosemarie Nagel. Unraveling in guessing games: An experimental study.The American Economic Review, 85(5):1313–1326, 1995. ISSN 00028282. URLhttp://www.jstor.org/stable/2950991

work page arXiv 1995

[62] [62]

hide-and-seek

Vincent P. Crawford and Nagore Iriberri. Fatal attraction: Salience, naïveté, and sophistication in experimental “hide-and-seek” games.American Economic Review, 97(5):1731–1750, 2007. doi: 10.1257/aer.97.5.1731. URLhttps://doi.org/10.1257/aer.97.5.1731. 16 EnactToM Figure 3:Cumulative tasks generated with and without ICL seed examples. With ICL (seed task...

work page doi:10.1257/aer.97.5.1731 2007

[63] [63]

scene_id

Another agent openscabinet_34. If Fast Downward finds this plan (or any valid alternative), the task is provably solvable. TheK-depth of 2 is read directly from the nesting structure during Step 1. F. Task generation agent workspace and prompt The generation agent operates in an isolated workspace directory: 1workspace/ 2working_task.json # task being aut...

work page

[64] [64]

new_scene[N]→load scene

work page

[65] [65]

Inspect seed tasks in sampled_tasks/ for inspiration

work page

[66] [66]

Do NOT hand-author :objects or :init

Edit working_task.json: author the problem_pddl :goal FIRST, then write task, agent_secrets, and mechanic bindings to match. Do NOT hand-author :objects or :init

work page

[67] [67]

judge[]→fix→repeat until pass

work page

[68] [68]

test_task[]→reject tasks that fail with full information

work page

[69] [69]

Wait for agent_3 to tell you whether stand_34 is open, then forward that to agent_0

submit_task[]. Core rules. – Author the PDDL goal as the source of truth; write narrative to match it. – Secrets state WHAT (room restrictions, target IDs, mechanic hints) but NEVER HOW (no coordination strategy, no relay instructions). – Every agent must make a distinct, non-substitutable contribution. – At least one physical action must be information-d...

work page

[70] [70]

intentional stance

argued that the critical test of ToM is attribution of false beliefs, and [51] formalised the “intentional stance” — predicting behaviour by attributing beliefs and rational agency. [52] computationally formalised this as Bayesian inverse planning, later extended to joint inference over beliefs, desires, and percepts [53]. On the coordination side, [54] a...

work page