Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration

Fardin Saad; Munindar P. Singh; Pradeep K. Murukannaiah

arxiv: 2507.02935 · v3 · submitted 2025-06-26 · 💻 cs.CL · cs.AI· cs.MA

Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration

Fardin Saad , Pradeep K. Murukannaiah , Munindar P. Singh This is my paper

Pith reviewed 2026-05-19 07:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA

keywords theory of mindinstruction inferencehuman-agent collaborationlarge language modelsfew-shot promptingchain-of-thought reasoningintent accuracy

0 comments

The pith

Tomcat LLM agent interprets ambiguous instructions at human-comparable levels in goal-oriented collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Instruction Inference task to test whether agents can infer unspoken intentions from incomplete or ambiguous instructions in a dynamic setting where a human principal and agent pursue a shared goal. It presents Tomcat, an LLM-based agent with two variants: one using few-shot chain-of-thought examples for structured reasoning and another relying on commonsense prompting. When implemented on GPT-4o and DeepSeek-R1 with the few-shot variant, Tomcat matches the performance of 52 human participants on intent accuracy, action optimality, and planning optimality. A sympathetic reader would care because this suggests agents can handle the natural vagueness of human instructions without requiring exhaustive explicit detail, enabling smoother human-agent teams.

Core claim

Tomcat, an LLM-based agent designed to exhibit Theory of Mind reasoning, achieves performance comparable to human participants on intent accuracy, action optimality, and planning optimality when using few-shot chain-of-thought prompting with GPT-4o and DeepSeek-R1 in the Instruction Inference task.

What carries the argument

Tomcat, an LLM-based agent that applies either few-shot chain-of-thought examples or commonsense knowledge to infer a principal's mental states and intentions from incomplete instructions.

If this is right

Agents can collaborate effectively with humans even when instructions are incomplete or ambiguous.
LLMs configured with few-shot chain-of-thought can exhibit Theory of Mind capabilities in goal-oriented collaborative environments.
The few-shot variant outperforms the commonsense-prompt variant in reaching human-comparable levels across the three metrics.
Such agents reduce the need for perfectly explicit instructions in human-agent teaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This capability could transfer to physical robots or virtual assistants operating in real-time shared workspaces.
Extending the task to longer sequences or multi-turn interactions would test whether the performance holds under greater uncertainty.
Widespread adoption might lower communication costs in teams by allowing agents to fill in reasonable gaps from context.

Load-bearing premise

The 52 human participants who received the same information as the commonsense-prompt variant provide a valid and representative baseline for human Theory of Mind performance in this task.

What would settle it

A replication experiment in which Tomcat with few-shot chain-of-thought on GPT-4o scores substantially below the human baseline on intent accuracy would falsify the comparability result.

Figures

Figures reproduced from arXiv: 2507.02935 by Fardin Saad, Munindar P. Singh, Pradeep K. Murukannaiah.

**Figure 1.** Figure 1: Pipeline of the Instruction Inference Task with Tomcat and participants. Perrault’s [1979] plan-based theory suggests that responses to indirect directives can be modeled through inference processes that consider speaker intentions. We investigate Tomcat’s ToM capabilities on Instruction Inference using GPT4o, DeepSeek-R1, and Gemma-3-27B LLMs. Our study employs two variants of Tomcat, Commonsense Prompt … view at source ↗

**Figure 2.** Figure 2: Example scenario in the Doors, Keys, and Gems world, showing an agent infer [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Excerpt of the Response Generation component. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 2.** Figure 2: Initial Grid Configuration . . . –Initial Grid: This grid in [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 5.** Figure 5: Additional rules of Common Ground for Tomcat with Fs-CoT. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Binary metrics with standard error of mean (SEM). Participants were not tasked [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Box plots of continuous metrics with the IQR and median collapsed into a single [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 1.** Figure 1: Grid Configuration [[‘.’ ‘.’ ‘.’ ‘y’ ‘.’ ‘.’ ‘.’ ‘b’ ‘W’ ‘W’ ‘W’ ‘W’] [‘r’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’ ‘.’ ‘r’ ‘W’ ‘W’ ‘W’ ‘g’] [‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘m’ ‘W’ ‘W’ ‘W’ ‘W’ ‘R’] [‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘g’ ‘.’ ‘.’ ‘.’ ‘B’ ‘.’ ‘.’ ‘R’ ‘.’ ‘.’ ‘.’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘W’] [‘.’ ‘.’ ‘.’ ‘Y’ ‘.’ ‘.’ ‘.’ ‘W’ ‘W’ ‘W… view at source ↗

**Figure 2.** Figure 2: Initial Grid Configuration [[‘r’ ‘.’ ‘.’ ‘.’ ‘m’ ‘W’ ‘W’ ‘g’] [‘y’ ‘.’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘.’] [‘.’ ‘R’ ‘.’ ‘.’ ‘.’ ‘.’ ‘h’ ‘.’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘Y’] [‘Y’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘g’ ‘W’ ‘g’ ‘W’ ‘W’ ‘W’ ‘W’ ‘g’]] Object Positions for [PITH_FULL_IMAGE:figures/full_fig_p037_2.png] view at source ↗

**Figure 3.** Figure 3: Observed Grid Configuration [[‘r’ ‘.’ ‘.’ ‘.’ ‘m’ ‘W’ ‘W’ ‘g’] [‘y’ ‘.’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘.’] [‘.’ ‘R’ ‘h’ ‘.’ ‘.’ ‘.’ ‘.’ ‘.’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘Y’] [‘Y’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘g’ ‘W’ ‘g’ ‘W’ ‘W’ ‘W’ ‘W’ ‘g’]] Object Positions for [PITH_FULL_IMAGE:figures/full_fig_p038_3.png] view at source ↗

**Figure 4.** Figure 4: Completed Grid Configuration [[‘r’ ‘.’ ‘.’ ‘.’ ‘m’ ‘W’ ‘W’ ‘g’] [‘y’ ‘.’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘.’] [‘.’ ‘.’ ‘.’ ‘m’ ‘.’ ‘.’ ‘.’ ‘.’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘Y’] [‘.’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’ ‘.’] [‘h’ ‘W’ ‘g’ ‘W’ ‘W’ ‘W’ ‘W’ ‘g’]] Object Positions for [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗

**Figure 5.** Figure 5: Initial Grid Configuration [[‘W’ ‘W’ ‘b’ ‘W’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’] [‘W’ ‘r’ ‘.’ ‘r’ ‘W’ ‘b’ ‘.’ ‘b’ ‘W’] [‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’] [‘W’ ‘W’ ‘.’ ‘.’ ‘m’ ‘.’ ‘.’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘g’] [‘h’ ‘.’ ‘.’ ‘.’ ‘.’ ‘.’ ‘B’ ‘B’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘g’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘R’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘R’ ‘W’ ‘W’ ‘W’ ‘W’] [… view at source ↗

**Figure 6.** Figure 6: Observed Grid Configuration [[‘W’ ‘W’ ‘b’ ‘W’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’] [‘W’ ‘r’ ‘.’ ‘r’ ‘W’ ‘b’ ‘.’ ‘b’ ‘W’] [‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’] [‘W’ ‘W’ ‘.’ ‘.’ ‘m’ ‘.’ ‘.’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘g’] [‘.’ ‘.’ ‘.’ ‘.’ ‘h’ ‘.’ ‘B’ ‘B’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘g’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘R’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘R’ ‘W’ ‘W’ ‘W’ ‘W’] … view at source ↗

**Figure 7.** Figure 7: Completed Grid Configuration [[‘W’ ‘W’ ‘b’ ‘W’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’] [‘W’ ‘.’ ‘.’ ‘,’ ‘W’ ‘b’ ‘.’ ‘b’ ‘W’] [‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’] [‘W’ ‘W’ ‘.’ ‘.’ ‘.’ ‘.’ ‘.’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘m’ ‘W’ ‘W’ ‘W’ ‘g’] [‘.’ ‘.’ ‘.’ ‘.’ ‘,’ ‘.’ ‘B’ ‘B’ ‘.’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘g’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’] [‘W’ ‘W’ ‘W’ ‘W’ ‘.’ ‘W’ ‘W’ ‘W’ ‘W’]… view at source ↗

read the original abstract

Successful human-agent teaming relies on an agent being able to understand instructions given by a (human) principal. In many cases, an instruction may be incomplete or ambiguous. In such cases, the agent must infer the unspoken intentions from their shared context, that is, it must exercise the principal's Theory of Mind (ToM) and infer the mental states of its principal. We consider the prospects of effective human-agent collaboration using large language models (LLMs). To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting incomplete or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implemented two variants of Tomcat. One, dubbed Fs-CoT (Fs for few-shot, CoT for chain-of-thought), is based on a small number of examples demonstrating the requisite structured reasoning. One, dubbed CP (commonsense prompt), relies on commonsense knowledge and information about the problem. We realized both variants of Tomcat on three leading LLMs, namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-agent collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New task for collaborative ToM is a solid step but the human comparison is undercut by unequal information given to models versus people.

read the letter

The key takeaway is that this work defines a practical new benchmark for Theory of Mind in ongoing human-agent teamwork and reports that advanced prompting lets some LLMs reach human levels on it. The Instruction Inference task looks like a genuine addition. It puts the agent in a setting where it must interpret vague instructions to help a principal achieve a goal, using shared context to infer intentions. They built Tomcat with two variants: one using few-shot examples plus chain-of-thought, and another relying on commonsense prompts. Testing these on GPT-4o, DeepSeek-R1, and Gemma-3-27B, then pitting them against 52 humans who saw the same base information, gives a direct comparison on intent accuracy, action optimality, and planning optimality. The result that Fs-CoT versions come close to humans is the main empirical point. This setup moves beyond static ToM puzzles by making the interaction dynamic and goal-directed. The authors deserve credit for running the human study and measuring multiple aspects of performance rather than just accuracy on a single metric. Where it gets shaky is the fairness of that human baseline. The people in the study received only the commonsense prompt information, while the Fs-CoT model got additional few-shot demonstrations and explicit reasoning structure. If the scores end up similar, it could be because the extra guidance helped the model catch up rather than because it has equivalent ToM skills. Without details on participant background, time allowed, or how optimality was judged, it's tough to know how solid the numbers are. The lack of error bars or significance tests also makes the comparable claim harder to evaluate. Readers working on agent design for real collaboration or on better ways to elicit reasoning from LLMs would get the most from this. It is not a complete overhaul of the field but it offers a usable task that others could build on. I think it merits peer review so that reviewers can push on the baseline setup and ask for more stats and controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Instruction Inference task to evaluate Theory of Mind (ToM) capabilities in dynamic, goal-oriented human-agent collaboration. It presents Tomcat, an LLM-based agent with two variants—Fs-CoT (few-shot examples plus structured chain-of-thought) and CP (commonsense prompt only)—implemented on GPT-4o, DeepSeek-R1, and Gemma-3-27B. Performance is measured via intent accuracy, action optimality, and planning optimality in a human study with 52 participants who received information equivalent to the CP variant. The central finding is that Tomcat Fs-CoT, particularly with GPT-4o and DeepSeek-R1, reaches performance levels comparable to the human participants.

Significance. If the empirical comparison holds under controlled conditions, the work offers a concrete, dynamic benchmark for ToM in collaborative settings and provides initial evidence that advanced LLMs can perform instruction inference at human-comparable levels. The task design and dual-variant agent architecture are useful contributions for future human-agent teaming research.

major comments (2)

[Abstract and Human Study] The central comparability claim (Fs-CoT Tomcat matching human performance on the three metrics) rests on the human baseline, yet participants received only the CP information while Fs-CoT supplies additional few-shot demonstrations and explicit chain-of-thought scaffolding. This asymmetry is load-bearing for interpreting whether comparable scores reflect ToM inference or the extra prompting structure; the manuscript does not address or control for it.
[Evaluation and Results] No error bars, statistical significance tests, variance measures, or details on participant selection, task timing, scoring rubrics for action/plan optimality, or prior exposure controls are reported for the 52-participant study. These omissions directly affect confidence in the cross-condition and cross-metric comparability statements.

minor comments (1)

[Metrics] Notation for the three optimality metrics could be defined more explicitly (e.g., formal definitions or pseudocode) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Abstract and Human Study] The central comparability claim (Fs-CoT Tomcat matching human performance on the three metrics) rests on the human baseline, yet participants received only the CP information while Fs-CoT supplies additional few-shot demonstrations and explicit chain-of-thought scaffolding. This asymmetry is load-bearing for interpreting whether comparable scores reflect ToM inference or the extra prompting structure; the manuscript does not address or control for it.

Authors: We agree that the manuscript should explicitly discuss the information asymmetry between the human participants (who received only CP-equivalent information) and the Fs-CoT agent variant. The human study was designed to provide a direct baseline for commonsense-based instruction inference without additional scaffolding, allowing comparison to the minimal CP agent. The Fs-CoT results are presented to illustrate the upper potential of LLMs under structured prompting. We will revise the abstract, introduction, and results sections to clarify this design choice, discuss its implications for interpreting ToM performance, and note that future work could include human conditions with few-shot examples to further control for the asymmetry. revision: partial
Referee: [Evaluation and Results] No error bars, statistical significance tests, variance measures, or details on participant selection, task timing, scoring rubrics for action/plan optimality, or prior exposure controls are reported for the 52-participant study. These omissions directly affect confidence in the cross-condition and cross-metric comparability statements.

Authors: We acknowledge that the current manuscript lacks these statistical and methodological details. We will add error bars and variance measures to all reported metrics, include appropriate statistical significance tests for cross-condition comparisons, and expand the methods section with full details on participant selection criteria, task timing, scoring rubrics for action and planning optimality, and controls for prior exposure or familiarity with similar tasks. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison to human baseline

full rationale

The paper introduces the Instruction Inference task and evaluates Tomcat (Fs-CoT and CP variants) on three LLMs against 52 human participants using direct metrics of intent accuracy, action optimality, and planning optimality. The central claim of comparable performance rests on experimental measurement of outputs against independently collected human data rather than any derivation, fitted parameter, or self-referential reduction. No equations or predictions are present that collapse to inputs by construction, and references to prior ToM literature are not load-bearing for the new task results. The analysis is therefore self-contained as a standard empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical or mathematical axioms. It relies on the standard assumption that human performance on the task is a meaningful proxy for Theory of Mind and that the chosen metrics (intent accuracy, action optimality, planning optimality) adequately capture collaborative success. No free parameters are fitted; the work is purely empirical.

axioms (1)

domain assumption Human participants given the same information as the CP variant provide a valid baseline for Theory of Mind performance.
This premise is required to interpret the comparability result; it is stated implicitly when the authors compare Tomcat to the 52 participants.

pith-pipeline@v0.9.0 · 5863 in / 1455 out tokens · 32548 ms · 2026-05-19T07:20:37.524428+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants on intent accuracy, action optimality, and planning optimality.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

[1]

Tomasello, M

M. Tomasello, M. Carpenter, J. Call, T. Behne, H. Moll, Understanding and sharing intentions: The origins of cultural cognition, Behavioral and Brain Sciences 28 (2005) 675–691

work page 2005
[2]

Duranti, The Anthropology of Intentions, Cambridge University Press, 2015

A. Duranti, The Anthropology of Intentions, Cambridge University Press, 2015

work page 2015
[3]

Frith, U

C. Frith, U. Frith, Theory of Mind, Current Biology 15 (2005) R644–R645

work page 2005
[4]

Erdogan, F

E. Erdogan, F. Dignum, R. Verbrugge, P. Yolum, ToMA: Computational theory of mind with abstractions for hybrid intelligence, Journal of Artificial Intelligence Research 82 (2025) 285–311. doi:10.1613/JAIR.1.16402

work page doi:10.1613/jair.1.16402 2025
[5]

Richards, M

J. Richards, M. Wessel, What you need is what you get: Theory of mind for an LLM-based code understanding assistant, in: 2024 IEEE Interna- tional Conference on Software Maintenance and Evolution (ICSME), 2024, pp. 666–671. doi:10.1109/ICSME58944.2024.00070

work page doi:10.1109/icsme58944.2024.00070 2024
[6]

J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, C. Becchio, Testing Theory of Mind in large language models and humans, Nature Human Behaviour 8 (2024) 1285–1295. doi:10.1038/s41562-024-01882-z

work page doi:10.1038/s41562-024-01882-z 2024
[7]

Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024

M. Kosinski, Evaluating large language models in Theory of Mind tasks, Proceedings of the National Academy of Sciences 121 (2024) e2405460121. doi:10.1073/pnas.2405460121

work page doi:10.1073/pnas.2405460121 2024
[8]

Gandhi, J.-P

K. Gandhi, J.-P. Fraenken, T. Gerstenberg, N. Goodman, Understanding so- cial reasoning in language models with language models, in: Proceedings of the 37th International Conference on Neural Information Processing Sys- tems, volume 36 of NIPS ’23, Curran Associates, Inc., Red Hook, NY , USA, 2023, pp. 13518–13529

work page 2023
[9]

Y . Wu, Y . He, Y . Jia, R. Mihalcea, Y . Chen, N. Deng, Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Associ- ation for Computational Linguistics, Singapore, 2023, pp. 10691–10706. URL: https...

work page 2023
[10]

Sbis `a, Speech acts in context, Language & Communication 22 (2002) 421–436

M. Sbis `a, Speech acts in context, Language & Communication 22 (2002) 421–436

work page 2002
[11]

Zhi-Xuan, L

T. Zhi-Xuan, L. Ying, V . Mansinghka, J. B. Tenenbaum, Pragmatic instruc- tion following and goal assistance via cooperative language-guided inverse planning, in: Proceedings of the 23rd International Conference on Au- tonomous Agents and Multiagent Systems (AAMAS), International Foun- dation for Autonomous Agents and Multiagent Systems, Auckland, New Zea...

work page doi:10.5555/3635637.3663074 2024
[12]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei...

work page arXiv 2020
[13]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY , USA, 2022, p. 1800. doi:10.5555/3600270.3602070

work page doi:10.5555/3600270.3602070 2022
[14]

J. L. Austin, How to Do Things with Words, Clarendon Press, Oxford, 1962

work page 1962
[15]

J. R. Searle, Indirect speech acts, in: P. Cole, J. L. Morgan (Eds.), Syntax and Semantics, V olume 3: Speech Acts, Academic Press, New York, 1975, pp. 59–82. doi:10.1163/9789004368811_004, reprinted in [33]

work page doi:10.1163/9789004368811_004 1975
[16]

H. H. Clark, Responding to indirect speech acts, Cognitive Psychology 11 (1979) 430–477

work page 1979
[17]

Sbis `a, Varieties of speech act norms, in: M

M. Sbis `a, Varieties of speech act norms, in: M. Witek, I. Witczak-Plisiecka (Eds.), Normativity and Variety of Speech Actions, volume 112 of Pozna´n Studies in the Philosophy of the Sciences and the Humanities , Brill Rodopi, Leiden, Netherlands, 2018, pp. 23–50. doi: 10.1163/9789004366527_ 003

work page doi:10.1163/9789004366527_ 2018
[18]

Sbis `a, Speech Acts and Other Topics in Pragmatics, Oxford University Press, Oxford, GB, 2023

M. Sbis `a, Speech Acts and Other Topics in Pragmatics, Oxford University Press, Oxford, GB, 2023. 30

work page 2023
[19]

C. R. Perrault, J. F. Allen, P. R. Cohen, Speech acts as a basis for under- standing dialogue coherence, American Journal of Computational Linguistics (1978) 32–39. URL: https://aclanthology.org/J78-3024, microfiche 79

work page 1978
[20]

P. R. Cohen, C. R. Perrault, Elements of a plan-based theory of speech acts, Cognitive Science 3 (1979) 117–212

work page 1979
[21]

W. Street, LLM Theory of Mind and alignment: Opportunities and risks, in: Proceedings of the Workshop on Theory of Mind in Human-AI Interaction (ToMinHAI@CHI2024), ACM, Honolulu, HI, USA, 2024, pp. 1–7. Work- shop paper; arXiv:2405.08154

work page arXiv 2024
[22]

Street, J

W. Street, J. O. Siy, G. Keeling, A. Baranes, B. Barnett, M. McKibben, T. Kanyere, A. Lentz, R. I. M. Dunbar, LLMs achieve adult human per- formance on Higher-Order Theory of Mind tasks, CoRR abs/2405.18870 (2024). URL: https://arxiv.org/abs/2405.18870

work page arXiv 2024
[23]

Jamali, Z

M. Jamali, Z. M. Williams, J. Cai, Unveiling Theory of Mind in large language models: A parallel to single neurons in the human brain, CoRR abs/2309.01660 (2023). URL: https://arxiv.org/abs/2309.01660

work page arXiv 2023
[24]

H. Li, Y . Chong, S. Stepputtis, J. Campbell, D. Hughes, C. Lewis, K. Sycara, Theory of Mind for multi-agent collaboration via large language models, in: Proceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing, Association for Computational Linguistics, Singa- pore, 2023, pp. 180–192. URL: https://aclanthology.org/2023.emn...

work page doi:10.18653/v1/2023.emnlp-main.13 2023
[25]

Zhang, X

S. Zhang, X. Wang, W. Zhang, Y . Chen, L. Gao, D. Wang, W. Zhang, X. Wang, Y . Wen, Mutual Theory of Mind in human-AI collaboration: An empirical study with LLM-driven AI agents in a real-time shared workspace task, CoRR abs/2409.08811 (2024). URL: https://arxiv.org/abs/2409.08811

work page arXiv 2024
[26]

Amirizaniani, E

M. Amirizaniani, E. Martin, M. Sivachenko, A. Mashhadi, C. Shah, Can LLMs reason like humans? Assessing Theory of Mind reasoning in LLMs for Open-Ended Questions, in: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 34–44

work page 2024
[27]

M. Sap, R. L. Bras, D. Fried, Y . Choi, Neural Theory-of-Mind? On the limits of social intelligence in large LMs, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Asso- ciation for Computational Linguistics, Abu Dhabi, United Arab Emirates, 31 2022, pp. 3762–3780. URL: https://aclanthology.org/2022.emnlp-main.248...

work page doi:10.18653/v1/2022.emnlp-main.248 2022
[28]

Verma, S

M. Verma, S. Bhambri, S. Kambhampati, Theory of Mind abilities of large language models in human-robot interaction: An illusion?, in: Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 2024, pp. 36–45

work page 2024
[29]

Large language models fail on trivial alterations to theory-of-mind tasks, 2023

T. Ullman, Large language models fail on trivial alterations to theory-of-mind tasks, CoRR abs/2302.08399 (2023). URL: https://arxiv.org/abs/2302.08399

work page arXiv 2023
[30]

J. W. A. Strachan, O. Pansardi, E. Scaliti, M. Celotto, K. Saxena, C. Yi, F. Manzi, A. Rufo, G. Manzi, M. S. A. Graziano, S. Panzeri, C. Becchio, GPT-4o reads the mind in the eyes, CoRR abs/2410.22309 (2024). URL: https://arxiv.org/abs/2410.22309

work page arXiv 2024
[31]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y . Cao, ReAct: Synergizing reasoning and acting in language models, in: Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023, pp. 1–33. URL: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

E. V . Clark, Common Ground, Wiley Online Library, The Handbook of Lan- guage Emergence, 2015, pp. 328–353. doi: 10.1002/9781118346136. ch15

work page doi:10.1002/9781118346136 2015
[33]

Can you pass me the red key?

A. P. Martinich (Ed.), The Philosophy of Language, Oxford University Press, New York, 1985. 32 Appendix A. Prompts Appendix A.1. Common Ground Component ♂robotGeneral Chain-of-Thought and Background You assist a human in a cooperative planning domain called Doors, Keys, and Gems, set in a gridworld. The human attempts to retrieve a specific gem, and you a...

work page 1985
[34]

Collect: red key at (0,0)

work page
[35]

Collect: yellow key at (1,0)

work page
[36]

Pass: red key and yellow key to the human at (3,2)

work page
[37]

Unlock: human unlocks the Red door at (3,1) and the Yellow door at (6,0)

work page
[38]

Retrieve: human retrieves gem at (7,0). Figure A.15: Demonstration exemplar component for Tomcat with CP part-4 40 ♂robotDemonstration Exemplars: CP Problem-2 (Initial Grid) — Problem (2) demonstrating initial, observed, completed grids with figures and the example. Figure 5: Initial Grid Configuration [[‘W’ ‘W’ ‘b’ ‘W’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’] [‘W’ ‘r’ ‘.’ ‘...

work page
[39]

Collect: red key at (1,1)

work page
[40]

Collect: yellow key at (1,3)

work page
[41]

Pass: red keys to the human’s future position either at (5,4) or (7,4)

work page
[42]

Unlock: human unlocks the Red doors at (8,4) or (9,4)

work page
[43]

’’’ Figure A.19: Demonstration exemplar component for Tomcat with CP part-7 44 Appendix A.4

Retrieve: human retrieves gem at either (10,0) or (10,8). ’’’ Figure A.19: Demonstration exemplar component for Tomcat with CP part-7 44 Appendix A.4. Demonstration: Fs-CoT ♂robotDemonstration Exemplars: Fs-CoT Exemplar-1 (k=7) Use the following problems and examples, delimited by triple quotes, to un- derstand how to generate the appropriate type, respon...

work page
[44]

Collect: blue key at (3,8)

work page
[45]

Collect: red key at (5,8)

work page
[46]

Pass: blue key and red key to the human at (8,4)

work page
[47]

Unlock: human unlocks the Blue door at (8,6) and the Red door at (8,8)

work page
[48]

Retrieve: human retrieves gem at (8,9). Figure A.20: Demonstration exemplar component for Tomcat with Fs-CoT part-1 45 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-2 (k=7) Human Action: The human moves left from their current position at (0,5) to (0,3), where they provide the instruction before continuing their movement. Instruction: On my way to pick u...

work page
[49]

Collect: yellow key at (8,10)

work page
[50]

Unlock: human unlocks Blue door at (4,7)

work page
[51]

Pass: yellow key to the human at (4,10) or their future position

work page
[52]

Unlock: human unlocks the Yellow door at (3,10)

work page
[53]

Retrieve: human retrieves gem at (2,10). Figure A.21: Demonstration exemplar component for Tomcat with Fs-CoT part-2 46 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-3 (k=7) Human Action: The human moves right from their current position at (3,3) to (3,5), where they provide the instruction before continuing their movement. Instruction: I’ll get the blue...

work page
[54]

Collect: red key at (8,6)

work page
[55]

Pass: red key to the human at (8,10) or their future position

work page
[56]

Unlock: human unlocks Red door at (9,9) and Blue door at (10,10)

work page
[57]

Retrieve: human retrieve a gem at either (11,8) or (11,12). Figure A.22: Demonstration exemplar component for Tomcat with Fs-CoT part-3 47 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-4 (k=7) Human Action: The human moves right from their current position at (5,3) to (5,6) and then moves downward to (6,6), where they provide the instruction before conti...

work page
[58]

Collect: red key at (0,7)

work page
[59]

Pass: red key to the human at (5,3) or their future position

work page
[60]

Unlock: human unlocks Red door at (6,3) and Blue door at (7,3)

work page
[61]

Retrieve: human retrieve a gem at either (8,0) or (8,4). Figure A.23: Demonstration exemplar component for Tomcat with Fs-CoT part-4 48 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-5 (k=7) Human Action: The human moves upward from their current position at (3,5) to (0,5), where they provide the instruction before continuing their movement. Instruction: ...

work page
[62]

Collect: yellow key at (9,0)

work page
[63]

Pass: yellow key to the human at (3,5) or (4,2) or their future position

work page
[64]

Unlock: human unlocks the Yellow door at (5,2) or (4,5) and the Red door at (6,2) or (6,4), (6,6)

work page
[65]

Retrieve: human retrieves the gem at either (7,2), (7,4) or (7,6). Figure A.24: Demonstration exemplar component for Tomcat with Fs-CoT part-5 49 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-6 (k=7) Human Action: The human moves down from their current position at (4,4) to (5,4), and then proceeds to move left to collect the red key at (5,0). After coll...

work page
[66]

Collect: blue key at (0,8)

work page
[67]

Unlock: Blue door at (2,4)

work page
[68]

Unlock: human unlocks Red door at (2,0)

work page
[69]

Retrieve: human retrieves gem at (0,0). Figure A.25: Demonstration exemplar component for Tomcat with Fs-CoT part-6 50 ♂robotDemonstration Exemplars: Fs-CoT Exemplar- 7 (k=7) Human Action: The human moves left from their current position at (6,2) to (6,1) and then upward to (5,1). Upon reaching (5,1), adjacent to the red door at (4,1), they provide the in...

work page
[70]

Collect: red key at (0,5)

work page
[71]

Collect: blue key at (2,7)

work page
[72]

Unlock: Blue door at (2,1) and Red door at (4,1)

work page
[73]

’’’ Figure A.26: Demonstration exemplar component for Tomcat with Fs-CoT part-7 51

Retrieve: human retrieves gem at (1,1). ’’’ Figure A.26: Demonstration exemplar component for Tomcat with Fs-CoT part-7 51

work page

[1] [1]

Tomasello, M

M. Tomasello, M. Carpenter, J. Call, T. Behne, H. Moll, Understanding and sharing intentions: The origins of cultural cognition, Behavioral and Brain Sciences 28 (2005) 675–691

work page 2005

[2] [2]

Duranti, The Anthropology of Intentions, Cambridge University Press, 2015

A. Duranti, The Anthropology of Intentions, Cambridge University Press, 2015

work page 2015

[3] [3]

Frith, U

C. Frith, U. Frith, Theory of Mind, Current Biology 15 (2005) R644–R645

work page 2005

[4] [4]

Erdogan, F

E. Erdogan, F. Dignum, R. Verbrugge, P. Yolum, ToMA: Computational theory of mind with abstractions for hybrid intelligence, Journal of Artificial Intelligence Research 82 (2025) 285–311. doi:10.1613/JAIR.1.16402

work page doi:10.1613/jair.1.16402 2025

[5] [5]

Richards, M

J. Richards, M. Wessel, What you need is what you get: Theory of mind for an LLM-based code understanding assistant, in: 2024 IEEE Interna- tional Conference on Software Maintenance and Evolution (ICSME), 2024, pp. 666–671. doi:10.1109/ICSME58944.2024.00070

work page doi:10.1109/icsme58944.2024.00070 2024

[6] [6]

J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, C. Becchio, Testing Theory of Mind in large language models and humans, Nature Human Behaviour 8 (2024) 1285–1295. doi:10.1038/s41562-024-01882-z

work page doi:10.1038/s41562-024-01882-z 2024

[7] [7]

Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024

M. Kosinski, Evaluating large language models in Theory of Mind tasks, Proceedings of the National Academy of Sciences 121 (2024) e2405460121. doi:10.1073/pnas.2405460121

work page doi:10.1073/pnas.2405460121 2024

[8] [8]

Gandhi, J.-P

K. Gandhi, J.-P. Fraenken, T. Gerstenberg, N. Goodman, Understanding so- cial reasoning in language models with language models, in: Proceedings of the 37th International Conference on Neural Information Processing Sys- tems, volume 36 of NIPS ’23, Curran Associates, Inc., Red Hook, NY , USA, 2023, pp. 13518–13529

work page 2023

[9] [9]

Y . Wu, Y . He, Y . Jia, R. Mihalcea, Y . Chen, N. Deng, Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Associ- ation for Computational Linguistics, Singapore, 2023, pp. 10691–10706. URL: https...

work page 2023

[10] [10]

Sbis `a, Speech acts in context, Language & Communication 22 (2002) 421–436

M. Sbis `a, Speech acts in context, Language & Communication 22 (2002) 421–436

work page 2002

[11] [11]

Zhi-Xuan, L

T. Zhi-Xuan, L. Ying, V . Mansinghka, J. B. Tenenbaum, Pragmatic instruc- tion following and goal assistance via cooperative language-guided inverse planning, in: Proceedings of the 23rd International Conference on Au- tonomous Agents and Multiagent Systems (AAMAS), International Foun- dation for Autonomous Agents and Multiagent Systems, Auckland, New Zea...

work page doi:10.5555/3635637.3663074 2024

[12] [12]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei...

work page arXiv 2020

[13] [13]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY , USA, 2022, p. 1800. doi:10.5555/3600270.3602070

work page doi:10.5555/3600270.3602070 2022

[14] [14]

J. L. Austin, How to Do Things with Words, Clarendon Press, Oxford, 1962

work page 1962

[15] [15]

J. R. Searle, Indirect speech acts, in: P. Cole, J. L. Morgan (Eds.), Syntax and Semantics, V olume 3: Speech Acts, Academic Press, New York, 1975, pp. 59–82. doi:10.1163/9789004368811_004, reprinted in [33]

work page doi:10.1163/9789004368811_004 1975

[16] [16]

H. H. Clark, Responding to indirect speech acts, Cognitive Psychology 11 (1979) 430–477

work page 1979

[17] [17]

Sbis `a, Varieties of speech act norms, in: M

M. Sbis `a, Varieties of speech act norms, in: M. Witek, I. Witczak-Plisiecka (Eds.), Normativity and Variety of Speech Actions, volume 112 of Pozna´n Studies in the Philosophy of the Sciences and the Humanities , Brill Rodopi, Leiden, Netherlands, 2018, pp. 23–50. doi: 10.1163/9789004366527_ 003

work page doi:10.1163/9789004366527_ 2018

[18] [18]

Sbis `a, Speech Acts and Other Topics in Pragmatics, Oxford University Press, Oxford, GB, 2023

M. Sbis `a, Speech Acts and Other Topics in Pragmatics, Oxford University Press, Oxford, GB, 2023. 30

work page 2023

[19] [19]

C. R. Perrault, J. F. Allen, P. R. Cohen, Speech acts as a basis for under- standing dialogue coherence, American Journal of Computational Linguistics (1978) 32–39. URL: https://aclanthology.org/J78-3024, microfiche 79

work page 1978

[20] [20]

P. R. Cohen, C. R. Perrault, Elements of a plan-based theory of speech acts, Cognitive Science 3 (1979) 117–212

work page 1979

[21] [21]

W. Street, LLM Theory of Mind and alignment: Opportunities and risks, in: Proceedings of the Workshop on Theory of Mind in Human-AI Interaction (ToMinHAI@CHI2024), ACM, Honolulu, HI, USA, 2024, pp. 1–7. Work- shop paper; arXiv:2405.08154

work page arXiv 2024

[22] [22]

Street, J

W. Street, J. O. Siy, G. Keeling, A. Baranes, B. Barnett, M. McKibben, T. Kanyere, A. Lentz, R. I. M. Dunbar, LLMs achieve adult human per- formance on Higher-Order Theory of Mind tasks, CoRR abs/2405.18870 (2024). URL: https://arxiv.org/abs/2405.18870

work page arXiv 2024

[23] [23]

Jamali, Z

M. Jamali, Z. M. Williams, J. Cai, Unveiling Theory of Mind in large language models: A parallel to single neurons in the human brain, CoRR abs/2309.01660 (2023). URL: https://arxiv.org/abs/2309.01660

work page arXiv 2023

[24] [24]

H. Li, Y . Chong, S. Stepputtis, J. Campbell, D. Hughes, C. Lewis, K. Sycara, Theory of Mind for multi-agent collaboration via large language models, in: Proceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing, Association for Computational Linguistics, Singa- pore, 2023, pp. 180–192. URL: https://aclanthology.org/2023.emn...

work page doi:10.18653/v1/2023.emnlp-main.13 2023

[25] [25]

Zhang, X

S. Zhang, X. Wang, W. Zhang, Y . Chen, L. Gao, D. Wang, W. Zhang, X. Wang, Y . Wen, Mutual Theory of Mind in human-AI collaboration: An empirical study with LLM-driven AI agents in a real-time shared workspace task, CoRR abs/2409.08811 (2024). URL: https://arxiv.org/abs/2409.08811

work page arXiv 2024

[26] [26]

Amirizaniani, E

M. Amirizaniani, E. Martin, M. Sivachenko, A. Mashhadi, C. Shah, Can LLMs reason like humans? Assessing Theory of Mind reasoning in LLMs for Open-Ended Questions, in: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 34–44

work page 2024

[27] [27]

M. Sap, R. L. Bras, D. Fried, Y . Choi, Neural Theory-of-Mind? On the limits of social intelligence in large LMs, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Asso- ciation for Computational Linguistics, Abu Dhabi, United Arab Emirates, 31 2022, pp. 3762–3780. URL: https://aclanthology.org/2022.emnlp-main.248...

work page doi:10.18653/v1/2022.emnlp-main.248 2022

[28] [28]

Verma, S

M. Verma, S. Bhambri, S. Kambhampati, Theory of Mind abilities of large language models in human-robot interaction: An illusion?, in: Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 2024, pp. 36–45

work page 2024

[29] [29]

Large language models fail on trivial alterations to theory-of-mind tasks, 2023

T. Ullman, Large language models fail on trivial alterations to theory-of-mind tasks, CoRR abs/2302.08399 (2023). URL: https://arxiv.org/abs/2302.08399

work page arXiv 2023

[30] [30]

J. W. A. Strachan, O. Pansardi, E. Scaliti, M. Celotto, K. Saxena, C. Yi, F. Manzi, A. Rufo, G. Manzi, M. S. A. Graziano, S. Panzeri, C. Becchio, GPT-4o reads the mind in the eyes, CoRR abs/2410.22309 (2024). URL: https://arxiv.org/abs/2410.22309

work page arXiv 2024

[31] [31]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y . Cao, ReAct: Synergizing reasoning and acting in language models, in: Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023, pp. 1–33. URL: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

E. V . Clark, Common Ground, Wiley Online Library, The Handbook of Lan- guage Emergence, 2015, pp. 328–353. doi: 10.1002/9781118346136. ch15

work page doi:10.1002/9781118346136 2015

[33] [33]

Can you pass me the red key?

A. P. Martinich (Ed.), The Philosophy of Language, Oxford University Press, New York, 1985. 32 Appendix A. Prompts Appendix A.1. Common Ground Component ♂robotGeneral Chain-of-Thought and Background You assist a human in a cooperative planning domain called Doors, Keys, and Gems, set in a gridworld. The human attempts to retrieve a specific gem, and you a...

work page 1985

[34] [34]

Collect: red key at (0,0)

work page

[35] [35]

Collect: yellow key at (1,0)

work page

[36] [36]

Pass: red key and yellow key to the human at (3,2)

work page

[37] [37]

Unlock: human unlocks the Red door at (3,1) and the Yellow door at (6,0)

work page

[38] [38]

Retrieve: human retrieves gem at (7,0). Figure A.15: Demonstration exemplar component for Tomcat with CP part-4 40 ♂robotDemonstration Exemplars: CP Problem-2 (Initial Grid) — Problem (2) demonstrating initial, observed, completed grids with figures and the example. Figure 5: Initial Grid Configuration [[‘W’ ‘W’ ‘b’ ‘W’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’] [‘W’ ‘r’ ‘.’ ‘...

work page

[39] [39]

Collect: red key at (1,1)

work page

[40] [40]

Collect: yellow key at (1,3)

work page

[41] [41]

Pass: red keys to the human’s future position either at (5,4) or (7,4)

work page

[42] [42]

Unlock: human unlocks the Red doors at (8,4) or (9,4)

work page

[43] [43]

’’’ Figure A.19: Demonstration exemplar component for Tomcat with CP part-7 44 Appendix A.4

Retrieve: human retrieves gem at either (10,0) or (10,8). ’’’ Figure A.19: Demonstration exemplar component for Tomcat with CP part-7 44 Appendix A.4. Demonstration: Fs-CoT ♂robotDemonstration Exemplars: Fs-CoT Exemplar-1 (k=7) Use the following problems and examples, delimited by triple quotes, to un- derstand how to generate the appropriate type, respon...

work page

[44] [44]

Collect: blue key at (3,8)

work page

[45] [45]

Collect: red key at (5,8)

work page

[46] [46]

Pass: blue key and red key to the human at (8,4)

work page

[47] [47]

Unlock: human unlocks the Blue door at (8,6) and the Red door at (8,8)

work page

[48] [48]

Retrieve: human retrieves gem at (8,9). Figure A.20: Demonstration exemplar component for Tomcat with Fs-CoT part-1 45 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-2 (k=7) Human Action: The human moves left from their current position at (0,5) to (0,3), where they provide the instruction before continuing their movement. Instruction: On my way to pick u...

work page

[49] [49]

Collect: yellow key at (8,10)

work page

[50] [50]

Unlock: human unlocks Blue door at (4,7)

work page

[51] [51]

Pass: yellow key to the human at (4,10) or their future position

work page

[52] [52]

Unlock: human unlocks the Yellow door at (3,10)

work page

[53] [53]

Retrieve: human retrieves gem at (2,10). Figure A.21: Demonstration exemplar component for Tomcat with Fs-CoT part-2 46 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-3 (k=7) Human Action: The human moves right from their current position at (3,3) to (3,5), where they provide the instruction before continuing their movement. Instruction: I’ll get the blue...

work page

[54] [54]

Collect: red key at (8,6)

work page

[55] [55]

Pass: red key to the human at (8,10) or their future position

work page

[56] [56]

Unlock: human unlocks Red door at (9,9) and Blue door at (10,10)

work page

[57] [57]

Retrieve: human retrieve a gem at either (11,8) or (11,12). Figure A.22: Demonstration exemplar component for Tomcat with Fs-CoT part-3 47 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-4 (k=7) Human Action: The human moves right from their current position at (5,3) to (5,6) and then moves downward to (6,6), where they provide the instruction before conti...

work page

[58] [58]

Collect: red key at (0,7)

work page

[59] [59]

Pass: red key to the human at (5,3) or their future position

work page

[60] [60]

Unlock: human unlocks Red door at (6,3) and Blue door at (7,3)

work page

[61] [61]

Retrieve: human retrieve a gem at either (8,0) or (8,4). Figure A.23: Demonstration exemplar component for Tomcat with Fs-CoT part-4 48 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-5 (k=7) Human Action: The human moves upward from their current position at (3,5) to (0,5), where they provide the instruction before continuing their movement. Instruction: ...

work page

[62] [62]

Collect: yellow key at (9,0)

work page

[63] [63]

Pass: yellow key to the human at (3,5) or (4,2) or their future position

work page

[64] [64]

Unlock: human unlocks the Yellow door at (5,2) or (4,5) and the Red door at (6,2) or (6,4), (6,6)

work page

[65] [65]

Retrieve: human retrieves the gem at either (7,2), (7,4) or (7,6). Figure A.24: Demonstration exemplar component for Tomcat with Fs-CoT part-5 49 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-6 (k=7) Human Action: The human moves down from their current position at (4,4) to (5,4), and then proceeds to move left to collect the red key at (5,0). After coll...

work page

[66] [66]

Collect: blue key at (0,8)

work page

[67] [67]

Unlock: Blue door at (2,4)

work page

[68] [68]

Unlock: human unlocks Red door at (2,0)

work page

[69] [69]

Retrieve: human retrieves gem at (0,0). Figure A.25: Demonstration exemplar component for Tomcat with Fs-CoT part-6 50 ♂robotDemonstration Exemplars: Fs-CoT Exemplar- 7 (k=7) Human Action: The human moves left from their current position at (6,2) to (6,1) and then upward to (5,1). Upon reaching (5,1), adjacent to the red door at (4,1), they provide the in...

work page

[70] [70]

Collect: red key at (0,5)

work page

[71] [71]

Collect: blue key at (2,7)

work page

[72] [72]

Unlock: Blue door at (2,1) and Red door at (4,1)

work page

[73] [73]

’’’ Figure A.26: Demonstration exemplar component for Tomcat with Fs-CoT part-7 51

Retrieve: human retrieves gem at (1,1). ’’’ Figure A.26: Demonstration exemplar component for Tomcat with Fs-CoT part-7 51

work page