Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration
Pith reviewed 2026-05-19 07:20 UTC · model grok-4.3
The pith
Tomcat LLM agent interprets ambiguous instructions at human-comparable levels in goal-oriented collaboration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tomcat, an LLM-based agent designed to exhibit Theory of Mind reasoning, achieves performance comparable to human participants on intent accuracy, action optimality, and planning optimality when using few-shot chain-of-thought prompting with GPT-4o and DeepSeek-R1 in the Instruction Inference task.
What carries the argument
Tomcat, an LLM-based agent that applies either few-shot chain-of-thought examples or commonsense knowledge to infer a principal's mental states and intentions from incomplete instructions.
If this is right
- Agents can collaborate effectively with humans even when instructions are incomplete or ambiguous.
- LLMs configured with few-shot chain-of-thought can exhibit Theory of Mind capabilities in goal-oriented collaborative environments.
- The few-shot variant outperforms the commonsense-prompt variant in reaching human-comparable levels across the three metrics.
- Such agents reduce the need for perfectly explicit instructions in human-agent teaming.
Where Pith is reading between the lines
- This capability could transfer to physical robots or virtual assistants operating in real-time shared workspaces.
- Extending the task to longer sequences or multi-turn interactions would test whether the performance holds under greater uncertainty.
- Widespread adoption might lower communication costs in teams by allowing agents to fill in reasonable gaps from context.
Load-bearing premise
The 52 human participants who received the same information as the commonsense-prompt variant provide a valid and representative baseline for human Theory of Mind performance in this task.
What would settle it
A replication experiment in which Tomcat with few-shot chain-of-thought on GPT-4o scores substantially below the human baseline on intent accuracy would falsify the comparability result.
Figures
read the original abstract
Successful human-agent teaming relies on an agent being able to understand instructions given by a (human) principal. In many cases, an instruction may be incomplete or ambiguous. In such cases, the agent must infer the unspoken intentions from their shared context, that is, it must exercise the principal's Theory of Mind (ToM) and infer the mental states of its principal. We consider the prospects of effective human-agent collaboration using large language models (LLMs). To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting incomplete or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implemented two variants of Tomcat. One, dubbed Fs-CoT (Fs for few-shot, CoT for chain-of-thought), is based on a small number of examples demonstrating the requisite structured reasoning. One, dubbed CP (commonsense prompt), relies on commonsense knowledge and information about the problem. We realized both variants of Tomcat on three leading LLMs, namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-agent collaboration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Instruction Inference task to evaluate Theory of Mind (ToM) capabilities in dynamic, goal-oriented human-agent collaboration. It presents Tomcat, an LLM-based agent with two variants—Fs-CoT (few-shot examples plus structured chain-of-thought) and CP (commonsense prompt only)—implemented on GPT-4o, DeepSeek-R1, and Gemma-3-27B. Performance is measured via intent accuracy, action optimality, and planning optimality in a human study with 52 participants who received information equivalent to the CP variant. The central finding is that Tomcat Fs-CoT, particularly with GPT-4o and DeepSeek-R1, reaches performance levels comparable to the human participants.
Significance. If the empirical comparison holds under controlled conditions, the work offers a concrete, dynamic benchmark for ToM in collaborative settings and provides initial evidence that advanced LLMs can perform instruction inference at human-comparable levels. The task design and dual-variant agent architecture are useful contributions for future human-agent teaming research.
major comments (2)
- [Abstract and Human Study] The central comparability claim (Fs-CoT Tomcat matching human performance on the three metrics) rests on the human baseline, yet participants received only the CP information while Fs-CoT supplies additional few-shot demonstrations and explicit chain-of-thought scaffolding. This asymmetry is load-bearing for interpreting whether comparable scores reflect ToM inference or the extra prompting structure; the manuscript does not address or control for it.
- [Evaluation and Results] No error bars, statistical significance tests, variance measures, or details on participant selection, task timing, scoring rubrics for action/plan optimality, or prior exposure controls are reported for the 52-participant study. These omissions directly affect confidence in the cross-condition and cross-metric comparability statements.
minor comments (1)
- [Metrics] Notation for the three optimality metrics could be defined more explicitly (e.g., formal definitions or pseudocode) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate planned revisions to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Human Study] The central comparability claim (Fs-CoT Tomcat matching human performance on the three metrics) rests on the human baseline, yet participants received only the CP information while Fs-CoT supplies additional few-shot demonstrations and explicit chain-of-thought scaffolding. This asymmetry is load-bearing for interpreting whether comparable scores reflect ToM inference or the extra prompting structure; the manuscript does not address or control for it.
Authors: We agree that the manuscript should explicitly discuss the information asymmetry between the human participants (who received only CP-equivalent information) and the Fs-CoT agent variant. The human study was designed to provide a direct baseline for commonsense-based instruction inference without additional scaffolding, allowing comparison to the minimal CP agent. The Fs-CoT results are presented to illustrate the upper potential of LLMs under structured prompting. We will revise the abstract, introduction, and results sections to clarify this design choice, discuss its implications for interpreting ToM performance, and note that future work could include human conditions with few-shot examples to further control for the asymmetry. revision: partial
-
Referee: [Evaluation and Results] No error bars, statistical significance tests, variance measures, or details on participant selection, task timing, scoring rubrics for action/plan optimality, or prior exposure controls are reported for the 52-participant study. These omissions directly affect confidence in the cross-condition and cross-metric comparability statements.
Authors: We acknowledge that the current manuscript lacks these statistical and methodological details. We will add error bars and variance measures to all reported metrics, include appropriate statistical significance tests for cross-condition comparisons, and expand the methods section with full details on participant selection criteria, task timing, scoring rubrics for action and planning optimality, and controls for prior exposure or familiarity with similar tasks. revision: yes
Circularity Check
No circularity in empirical comparison to human baseline
full rationale
The paper introduces the Instruction Inference task and evaluates Tomcat (Fs-CoT and CP variants) on three LLMs against 52 human participants using direct metrics of intent accuracy, action optimality, and planning optimality. The central claim of comparable performance rests on experimental measurement of outputs against independently collected human data rather than any derivation, fitted parameter, or self-referential reduction. No equations or predictions are present that collapse to inputs by construction, and references to prior ToM literature are not load-bearing for the new task results. The analysis is therefore self-contained as a standard empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human participants given the same information as the CP variant provide a valid baseline for Theory of Mind performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants on intent accuracy, action optimality, and planning optimality.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Tomasello, M. Carpenter, J. Call, T. Behne, H. Moll, Understanding and sharing intentions: The origins of cultural cognition, Behavioral and Brain Sciences 28 (2005) 675–691
work page 2005
-
[2]
Duranti, The Anthropology of Intentions, Cambridge University Press, 2015
A. Duranti, The Anthropology of Intentions, Cambridge University Press, 2015
work page 2015
- [3]
-
[4]
E. Erdogan, F. Dignum, R. Verbrugge, P. Yolum, ToMA: Computational theory of mind with abstractions for hybrid intelligence, Journal of Artificial Intelligence Research 82 (2025) 285–311. doi:10.1613/JAIR.1.16402
-
[5]
J. Richards, M. Wessel, What you need is what you get: Theory of mind for an LLM-based code understanding assistant, in: 2024 IEEE Interna- tional Conference on Software Maintenance and Evolution (ICSME), 2024, pp. 666–671. doi:10.1109/ICSME58944.2024.00070
-
[6]
J. W. A. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, M. S. A. Graziano, C. Becchio, Testing Theory of Mind in large language models and humans, Nature Human Behaviour 8 (2024) 1285–1295. doi:10.1038/s41562-024-01882-z
-
[7]
M. Kosinski, Evaluating large language models in Theory of Mind tasks, Proceedings of the National Academy of Sciences 121 (2024) e2405460121. doi:10.1073/pnas.2405460121
-
[8]
K. Gandhi, J.-P. Fraenken, T. Gerstenberg, N. Goodman, Understanding so- cial reasoning in language models with language models, in: Proceedings of the 37th International Conference on Neural Information Processing Sys- tems, volume 36 of NIPS ’23, Curran Associates, Inc., Red Hook, NY , USA, 2023, pp. 13518–13529
work page 2023
-
[9]
Y . Wu, Y . He, Y . Jia, R. Mihalcea, Y . Chen, N. Deng, Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Associ- ation for Computational Linguistics, Singapore, 2023, pp. 10691–10706. URL: https...
work page 2023
-
[10]
Sbis `a, Speech acts in context, Language & Communication 22 (2002) 421–436
M. Sbis `a, Speech acts in context, Language & Communication 22 (2002) 421–436
work page 2002
-
[11]
T. Zhi-Xuan, L. Ying, V . Mansinghka, J. B. Tenenbaum, Pragmatic instruc- tion following and goal assistance via cooperative language-guided inverse planning, in: Proceedings of the 23rd International Conference on Au- tonomous Agents and Multiagent Systems (AAMAS), International Foun- dation for Autonomous Agents and Multiagent Systems, Auckland, New Zea...
-
[12]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei...
-
[13]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY , USA, 2022, p. 1800. doi:10.5555/3600270.3602070
-
[14]
J. L. Austin, How to Do Things with Words, Clarendon Press, Oxford, 1962
work page 1962
-
[15]
J. R. Searle, Indirect speech acts, in: P. Cole, J. L. Morgan (Eds.), Syntax and Semantics, V olume 3: Speech Acts, Academic Press, New York, 1975, pp. 59–82. doi:10.1163/9789004368811_004, reprinted in [33]
-
[16]
H. H. Clark, Responding to indirect speech acts, Cognitive Psychology 11 (1979) 430–477
work page 1979
-
[17]
Sbis `a, Varieties of speech act norms, in: M
M. Sbis `a, Varieties of speech act norms, in: M. Witek, I. Witczak-Plisiecka (Eds.), Normativity and Variety of Speech Actions, volume 112 of Pozna´n Studies in the Philosophy of the Sciences and the Humanities , Brill Rodopi, Leiden, Netherlands, 2018, pp. 23–50. doi: 10.1163/9789004366527_ 003
-
[18]
Sbis `a, Speech Acts and Other Topics in Pragmatics, Oxford University Press, Oxford, GB, 2023
M. Sbis `a, Speech Acts and Other Topics in Pragmatics, Oxford University Press, Oxford, GB, 2023. 30
work page 2023
-
[19]
C. R. Perrault, J. F. Allen, P. R. Cohen, Speech acts as a basis for under- standing dialogue coherence, American Journal of Computational Linguistics (1978) 32–39. URL: https://aclanthology.org/J78-3024, microfiche 79
work page 1978
-
[20]
P. R. Cohen, C. R. Perrault, Elements of a plan-based theory of speech acts, Cognitive Science 3 (1979) 117–212
work page 1979
- [21]
- [22]
- [23]
-
[24]
H. Li, Y . Chong, S. Stepputtis, J. Campbell, D. Hughes, C. Lewis, K. Sycara, Theory of Mind for multi-agent collaboration via large language models, in: Proceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing, Association for Computational Linguistics, Singa- pore, 2023, pp. 180–192. URL: https://aclanthology.org/2023.emn...
-
[25]
S. Zhang, X. Wang, W. Zhang, Y . Chen, L. Gao, D. Wang, W. Zhang, X. Wang, Y . Wen, Mutual Theory of Mind in human-AI collaboration: An empirical study with LLM-driven AI agents in a real-time shared workspace task, CoRR abs/2409.08811 (2024). URL: https://arxiv.org/abs/2409.08811
-
[26]
M. Amirizaniani, E. Martin, M. Sivachenko, A. Mashhadi, C. Shah, Can LLMs reason like humans? Assessing Theory of Mind reasoning in LLMs for Open-Ended Questions, in: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 34–44
work page 2024
-
[27]
M. Sap, R. L. Bras, D. Fried, Y . Choi, Neural Theory-of-Mind? On the limits of social intelligence in large LMs, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Asso- ciation for Computational Linguistics, Abu Dhabi, United Arab Emirates, 31 2022, pp. 3762–3780. URL: https://aclanthology.org/2022.emnlp-main.248...
- [28]
-
[29]
Large language models fail on trivial alterations to theory-of-mind tasks, 2023
T. Ullman, Large language models fail on trivial alterations to theory-of-mind tasks, CoRR abs/2302.08399 (2023). URL: https://arxiv.org/abs/2302.08399
- [30]
-
[31]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y . Cao, ReAct: Synergizing reasoning and acting in language models, in: Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 2023, pp. 1–33. URL: https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
E. V . Clark, Common Ground, Wiley Online Library, The Handbook of Lan- guage Emergence, 2015, pp. 328–353. doi: 10.1002/9781118346136. ch15
-
[33]
A. P. Martinich (Ed.), The Philosophy of Language, Oxford University Press, New York, 1985. 32 Appendix A. Prompts Appendix A.1. Common Ground Component ♂robotGeneral Chain-of-Thought and Background You assist a human in a cooperative planning domain called Doors, Keys, and Gems, set in a gridworld. The human attempts to retrieve a specific gem, and you a...
work page 1985
-
[34]
Collect: red key at (0,0)
-
[35]
Collect: yellow key at (1,0)
-
[36]
Pass: red key and yellow key to the human at (3,2)
-
[37]
Unlock: human unlocks the Red door at (3,1) and the Yellow door at (6,0)
-
[38]
Retrieve: human retrieves gem at (7,0). Figure A.15: Demonstration exemplar component for Tomcat with CP part-4 40 ♂robotDemonstration Exemplars: CP Problem-2 (Initial Grid) — Problem (2) demonstrating initial, observed, completed grids with figures and the example. Figure 5: Initial Grid Configuration [[‘W’ ‘W’ ‘b’ ‘W’ ‘W’ ‘W’ ‘r’ ‘W’ ‘W’] [‘W’ ‘r’ ‘.’ ‘...
-
[39]
Collect: red key at (1,1)
-
[40]
Collect: yellow key at (1,3)
-
[41]
Pass: red keys to the human’s future position either at (5,4) or (7,4)
-
[42]
Unlock: human unlocks the Red doors at (8,4) or (9,4)
-
[43]
’’’ Figure A.19: Demonstration exemplar component for Tomcat with CP part-7 44 Appendix A.4
Retrieve: human retrieves gem at either (10,0) or (10,8). ’’’ Figure A.19: Demonstration exemplar component for Tomcat with CP part-7 44 Appendix A.4. Demonstration: Fs-CoT ♂robotDemonstration Exemplars: Fs-CoT Exemplar-1 (k=7) Use the following problems and examples, delimited by triple quotes, to un- derstand how to generate the appropriate type, respon...
-
[44]
Collect: blue key at (3,8)
-
[45]
Collect: red key at (5,8)
-
[46]
Pass: blue key and red key to the human at (8,4)
-
[47]
Unlock: human unlocks the Blue door at (8,6) and the Red door at (8,8)
-
[48]
Retrieve: human retrieves gem at (8,9). Figure A.20: Demonstration exemplar component for Tomcat with Fs-CoT part-1 45 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-2 (k=7) Human Action: The human moves left from their current position at (0,5) to (0,3), where they provide the instruction before continuing their movement. Instruction: On my way to pick u...
-
[49]
Collect: yellow key at (8,10)
-
[50]
Unlock: human unlocks Blue door at (4,7)
-
[51]
Pass: yellow key to the human at (4,10) or their future position
-
[52]
Unlock: human unlocks the Yellow door at (3,10)
-
[53]
Retrieve: human retrieves gem at (2,10). Figure A.21: Demonstration exemplar component for Tomcat with Fs-CoT part-2 46 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-3 (k=7) Human Action: The human moves right from their current position at (3,3) to (3,5), where they provide the instruction before continuing their movement. Instruction: I’ll get the blue...
-
[54]
Collect: red key at (8,6)
-
[55]
Pass: red key to the human at (8,10) or their future position
-
[56]
Unlock: human unlocks Red door at (9,9) and Blue door at (10,10)
-
[57]
Retrieve: human retrieve a gem at either (11,8) or (11,12). Figure A.22: Demonstration exemplar component for Tomcat with Fs-CoT part-3 47 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-4 (k=7) Human Action: The human moves right from their current position at (5,3) to (5,6) and then moves downward to (6,6), where they provide the instruction before conti...
-
[58]
Collect: red key at (0,7)
-
[59]
Pass: red key to the human at (5,3) or their future position
-
[60]
Unlock: human unlocks Red door at (6,3) and Blue door at (7,3)
-
[61]
Retrieve: human retrieve a gem at either (8,0) or (8,4). Figure A.23: Demonstration exemplar component for Tomcat with Fs-CoT part-4 48 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-5 (k=7) Human Action: The human moves upward from their current position at (3,5) to (0,5), where they provide the instruction before continuing their movement. Instruction: ...
-
[62]
Collect: yellow key at (9,0)
-
[63]
Pass: yellow key to the human at (3,5) or (4,2) or their future position
-
[64]
Unlock: human unlocks the Yellow door at (5,2) or (4,5) and the Red door at (6,2) or (6,4), (6,6)
-
[65]
Retrieve: human retrieves the gem at either (7,2), (7,4) or (7,6). Figure A.24: Demonstration exemplar component for Tomcat with Fs-CoT part-5 49 ♂robotDemonstration Exemplars: Fs-CoT Exemplar-6 (k=7) Human Action: The human moves down from their current position at (4,4) to (5,4), and then proceeds to move left to collect the red key at (5,0). After coll...
-
[66]
Collect: blue key at (0,8)
-
[67]
Unlock: Blue door at (2,4)
-
[68]
Unlock: human unlocks Red door at (2,0)
-
[69]
Retrieve: human retrieves gem at (0,0). Figure A.25: Demonstration exemplar component for Tomcat with Fs-CoT part-6 50 ♂robotDemonstration Exemplars: Fs-CoT Exemplar- 7 (k=7) Human Action: The human moves left from their current position at (6,2) to (6,1) and then upward to (5,1). Upon reaching (5,1), adjacent to the red door at (4,1), they provide the in...
-
[70]
Collect: red key at (0,5)
-
[71]
Collect: blue key at (2,7)
-
[72]
Unlock: Blue door at (2,1) and Red door at (4,1)
-
[73]
’’’ Figure A.26: Demonstration exemplar component for Tomcat with Fs-CoT part-7 51
Retrieve: human retrieves gem at (1,1). ’’’ Figure A.26: Demonstration exemplar component for Tomcat with Fs-CoT part-7 51
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.