pith. sign in

arxiv: 2605.15205 · v1 · pith:X6SQH5VVnew · submitted 2026-04-28 · 💻 cs.AI

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

Pith reviewed 2026-05-19 17:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords Theory of MindHuman-AI InteractionLarge Language ModelsInteractive EvaluationBenchmark LimitationsSocial AIUser StudiesDynamic Interactions
0
0 comments X

The pith

Theory of Mind gains on static benchmarks often fail to improve performance in live human-AI interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether techniques that boost Theory of Mind in large language models, as measured by story-based multiple-choice tests, actually produce better results when models interact directly with humans. It shifts evaluation to a first-person, ongoing dialogue setting across goal-driven tasks like coding and math as well as open-ended ones like counseling. Experiments with four representative enhancement methods, using real datasets and a user study, show that benchmark success does not reliably carry over to dynamic exchanges. The work argues that interaction-based assessments are necessary to develop LLMs that can engage effectively in human-AI symbiosis.

Core claim

The authors establish that ToM enhancement techniques which succeed on static, multiple-choice benchmarks frequently fail to deliver measurable benefits when LLMs engage in live, open-ended interactions with humans, as demonstrated across coding, math, and counseling scenarios using both automated metrics and user feedback.

What carries the argument

Interactive ToM evaluation paradigm with perspective shift from third-person stories to first-person dialogue and metric shift from accuracy on questions to task success and user experience in ongoing exchanges.

If this is right

  • Evaluation protocols for socially capable LLMs must incorporate live interaction rather than relying solely on static benchmarks.
  • ToM methods may require additional adaptation to handle real-time perspective taking and response adjustment during conversation.
  • Both goal-oriented and experience-oriented tasks need coverage to reveal where benchmark improvements break down.
  • Over-reliance on current ToM metrics risks developing models that appear socially aware only in controlled test settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap suggests that direct training on interactive data or explicit modeling of conversation history may be needed alongside ToM modules.
  • Other capabilities such as instruction following or emotional tone adjustment could prove more decisive for interaction quality than isolated ToM gains.
  • Hybrid evaluation suites combining static probes with live sessions could provide a more reliable signal for iterative model development.

Load-bearing premise

The chosen interactive tasks and user-study protocol sufficiently represent the first-person, dynamic, open-ended nature of typical human-AI interactions.

What would settle it

A follow-up study that finds one or more ToM improvement techniques produce statistically significant gains in both task completion rates and user satisfaction scores across repeated open-ended dialogues with multiple models.

Figures

Figures reproduced from arXiv: 2605.15205 by Haotian Li, Huamin Qu, Jianxun Lian, Nanxu Gong, Xing Xie, Yanjie Fu, Zishu Zhao, Zixin Chen.

Figure 1
Figure 1. Figure 1: Based on a new dynamic and interactive evaluation paradigm, our research explores the effectiveness of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our interactive ToM evaluation paradigm for real-world HAI interaction. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model variants’ performance on CollabLLM. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case studies from a goal-oriented task (left) and a experience-oriented task (right). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cases of positive (left) and negative (right) experiences and corresponding comments in our user study. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt of FaR. PT You are a helpful assistant. When a user asks you a question, your internal thought process to generate the most helpful response should be as follows: Step 1: Perspective-Taking (Understanding the User's Context and Needs) Before you begin to formulate your answer, first analyze the user's question (`{user_question}`) and attempt to understand the following aspects of their perspective: … view at source ↗
Figure 7
Figure 7. Figure 7: Prompt of PT. SFT "instruction": "Answer the question based on the story.", "input": "Story: The sun shone through the large glass doors of the hotel lobby, illuminating the marble floor and casting a warm glow over the comfortable seating areas. Soft music filled the air, mingling with the gentle hum of conversation and the occasional chime of the elevators in the bustling hotel. As I entered the hotel lo… view at source ↗
Figure 8
Figure 8. Figure 8: Data example of SFT. RL prompt": [ { "content": "<|im_start|>system\nYou are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Now the us… view at source ↗
Figure 9
Figure 9. Figure 9: Data example of RL [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Word cloud of participants’ filtered com [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Main experiment interface. (A) User initialization and task instructions. (B) Conversation window with [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: User ranking interface. (A) Model responses presented for comparison. (B) User ranking panel where [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Feedback interface shown after each round of response ranking. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The case on GPT-4o variants [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The case on GPT-4o variants [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The case on Llama-3.1-8B variants [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The case on Llama-3.1-8B variants [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The user case on GPT-4o variants [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The user case on Llama-3.1-8B variants [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
read the original abstract

Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that improvements to Theory of Mind (ToM) capabilities in LLMs, as measured on static third-person benchmarks, do not reliably translate into better performance during dynamic, first-person human-AI interactions. The authors introduce an interactive ToM evaluation paradigm that shifts both perspective and metrics, then apply it to four representative ToM enhancement techniques across four real-world datasets plus a user study; the tasks span goal-oriented domains (coding, math) and experience-oriented domains (counseling). The central empirical finding is that static-benchmark gains frequently fail to produce corresponding gains under interactive conditions.

Significance. If the central observation holds, the work supplies concrete evidence that current ToM enhancement pipelines may be insufficient for socially situated HAI use cases and underscores the value of interaction-based assessment. The empirical scope—four datasets plus a controlled user study—is a clear strength that grounds the claim in both quantitative and qualitative data.

major comments (1)
  1. Evaluation Setup / Paradigm Shift section: the claim that the chosen tasks and protocol capture the 'first-person, dynamic, and open-ended nature of human-AI interactions' is load-bearing for the generalization that static ToM gains 'do not always translate.' The manuscript must specify whether interactions are subject to turn limits, predefined success criteria, or other constraints that could artifactually limit transfer; without this detail the observed dissociation could be an artifact of task structure rather than a general property of ToM improvement.
minor comments (2)
  1. Abstract: the phrase 'four real-world datasets' is used without naming them or indicating their domains; a parenthetical list would improve immediate readability.
  2. User-study description: participant selection criteria, exact interaction metrics, and statistical controls (e.g., multiple-comparison correction, power analysis) receive only high-level mention; adding these details would strengthen reproducibility without altering the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify key aspects of our evaluation paradigm. We address the major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: Evaluation Setup / Paradigm Shift section: the claim that the chosen tasks and protocol capture the 'first-person, dynamic, and open-ended nature of human-AI interactions' is load-bearing for the generalization that static ToM gains 'do not always translate.' The manuscript must specify whether interactions are subject to turn limits, predefined success criteria, or other constraints that could artifactually limit transfer; without this detail the observed dissociation could be an artifact of task structure rather than a general property of ToM improvement.

    Authors: We agree that explicit details on interaction constraints are necessary to support the generalization. In the revised manuscript, we will expand the Evaluation Setup section with a new subsection titled 'Interaction Protocol and Constraints.' This will specify that all interactions were conducted without artificial turn limits, allowing natural, open-ended dialogue until participants indicated task completion or a natural stopping point (typically 5-15 turns depending on task complexity). Predefined success criteria were task-specific and drawn directly from the real-world datasets: for goal-oriented tasks (coding and math), success was determined by objective metrics such as code passing unit tests or correct mathematical solutions; for experience-oriented tasks (counseling), it relied on post-interaction user surveys measuring perceived helpfulness and emotional support. We will also acknowledge that these criteria, while grounded in dataset objectives, represent one operationalization and discuss how they align with dynamic HAI settings. These additions will directly address the concern that the observed dissociation might be artifactual. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of static vs. interactive ToM evaluation

full rationale

The paper conducts an empirical study using real-world datasets and a user study to compare ToM enhancement techniques on static benchmarks versus dynamic HAI interactions. No derivations, equations, fitted parameters, or self-citation chains are present that reduce any reported finding to its inputs by construction. The central claims rest on experimental results and protocol design rather than any self-referential reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical evaluation study. No mathematical free parameters, new physical entities, or ad-hoc axioms are introduced beyond standard assumptions in user-study design and task selection.

axioms (1)
  • domain assumption The selected goal-oriented and experience-oriented tasks adequately sample the space of human-AI interactions.
    Invoked when generalizing from coding/math and counseling scenarios to broader HAI symbiosis.

pith-pipeline@v0.9.0 · 5740 in / 1158 out tokens · 29788 ms · 2026-05-19T17:52:03.442586+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    arXiv preprint arXiv:2405.08154 , year=

    Llm theory of mind and alignment: Opportunities and risks , author=. arXiv preprint arXiv:2405.08154 , year=

  9. [9]

    Nature Human Behaviour , volume=

    Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=. 2024 , publisher=

  10. [10]

    Proceedings of the National Academy of Sciences , volume=

    Evaluating large language models in theory of mind tasks , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  11. [11]

    Revisiting the evaluation of theory of mind through question answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

  12. [12]

    The 2023 Conference on Empirical Methods in Natural Language Processing , year=

    Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

  13. [13]

    Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL) , pages=

    ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind , author=. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL) , pages=

  14. [14]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    FANToM: A benchmark for stress-testing machine theory of mind in interactions , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Understanding social reasoning in language models with language models , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    The 62nd Annual Meeting of the Association for Computational Linguistics , year=

    OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models , author=. The 62nd Annual Meeting of the Association for Computational Linguistics , year=

  17. [17]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  18. [18]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    ToMBench: Benchmarking Theory of Mind in Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  19. [19]

    The Thirteenth International Conference on Learning Representations , year =

    Explore Theory of Mind: program-guided adversarial data generation for theory of mind reasoning , author=. The Thirteenth International Conference on Learning Representations , year =

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  21. [21]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    A notion of complexity for theory of mind via discrete world models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  22. [22]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Metacognitive prompting improves understanding in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  23. [23]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  24. [24]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models’ Theory-of-Mind , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  25. [25]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  26. [26]

    McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, and Manaal Faruqui

    How far are large language models from agents with theory-of-mind? , author=. arXiv preprint arXiv:2310.03051 , year=

  27. [27]

    arXiv preprint arXiv:2504.01698 , year=

    Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models? , author=. arXiv preprint arXiv:2504.01698 , year=

  28. [28]

    Proceedings of the International Joint Conference on Neural Networks , pages =

    Zhanwen Chen and Tianchun Wang and Yizhou Wang and Michal Kosinski and Xiang Zhang and Yun Fu and Sheng Li , title =. Proceedings of the International Joint Conference on Neural Networks , pages =

  29. [29]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Coke: A cognitive knowledge graph for machine theory of mind , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  30. [30]

    arXiv preprint arXiv:2502.11881 , year=

    Hypothesis-driven theory-of-mind reasoning for large language models , author=. arXiv preprint arXiv:2502.11881 , year=

  31. [31]

    ICLR 2025 Workshop on Foundation Models in the Wild , year=

    Autotom: Automated bayesian inverse planning and model discovery for open-ended theory of mind , author=. ICLR 2025 Workshop on Foundation Models in the Wild , year=

  32. [32]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    I cast detect thoughts: Learning to converse and guide with intents and theory-of-mind in dungeons and dragons , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  33. [33]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  34. [34]

    Theory of Mind in Large Language Models: Assessment and Enhancement , booktitle =

    Ruirui Chen and Weifeng Jiang and Chengwei Qin and Cheston Tan , editor =. Theory of Mind in Large Language Models: Assessment and Enhancement , booktitle =

  35. [35]

    arXiv preprint arXiv:2502.08796 , year=

    A systematic review on the evaluation of large language models in theory of mind tasks , author=. arXiv preprint arXiv:2502.08796 , year=

  36. [36]

    arXiv preprint arXiv:2502.06470 , year=

    A survey of theory of mind in large language models: Evaluations, representations, and safety risks , author=. arXiv preprint arXiv:2502.06470 , year=

  37. [37]

    Philosophical Transactions B , volume=

    Re-evaluating Theory of Mind evaluation in large language models , author=. Philosophical Transactions B , volume=. 2025 , publisher=

  38. [38]

    Hofman , editor =

    Serina Chang and Ashton Anderson and Jake M. Hofman , editor =. ChatBench: From Static Benchmarks to Human-AI Evaluation , booktitle =

  39. [39]

    CollabLLM: From Passive Responders to Active Collaborators , year=

    Wu, Shirley and Galley, Michel and Peng, Baolin and Cheng, Hao and Li, Gavin and Dou, Yao and Cai, Weixin and Zou, James and Leskovec, Jure and Gao, Jianfeng , booktitle=. CollabLLM: From Passive Responders to Active Collaborators , year=

  40. [40]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Mentalchat16k: A benchmark dataset for conversational mental health assistance , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

  41. [41]

    Towards Emotional Support Dialog Systems , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  42. [42]

    arXiv preprint arXiv:2508.15013 , year=

    Goals and the Structure of Experience , author=. arXiv preprint arXiv:2508.15013 , year=

  43. [43]

    Advanced personality , pages=

    Cognitive-experiential self-theory , author=. Advanced personality , pages=. 1998 , publisher=

  44. [44]

    , author=

    Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. , author=. American psychologist , volume=. 2000 , publisher=

  45. [45]

    2010 , publisher=

    Experience design: Technology for all the right reasons , author=. 2010 , publisher=

  46. [46]

    , author=

    Theory of mind in middle childhood: Longitudinal associations with executive function and social competence. , author=. Developmental psychology , volume=. 2016 , publisher=

  47. [47]

    , author=

    Theory of mind and prosocial behavior in childhood: A meta-analytic review. , author=. Developmental psychology , volume=. 2016 , publisher=

  48. [48]

    theory of mind

    Does the autistic child have a “theory of mind”? , author=. Cognition , volume=. 1985 , publisher=

  49. [49]

    Prolific · Quickly find research participants you can trust. , url=. www.prolific.com , author=

  50. [50]

    , author=

    Interaction process analysis; a method for the study of small groups. , author=. 1950 , publisher=

  51. [51]

    Science , year =

    Evidence for a collective intelligence factor in the performance of human groups , author =. Science , year =

  52. [52]

    PLOS ONE , year =

    Reading the Mind in the Eyes or on the Web? Theory of Mind predicts collective intelligence equally well online and face-to-face , author =. PLOS ONE , year =

  53. [53]

    Brain , year =

    Two systems for empathy: A double dissociation between emotional and cognitive empathy in inferior frontal gyrus versus ventromedial prefrontal lesions , author =. Brain , year =

  54. [54]

    International Journal of Listening , year =

    The Relative Effectiveness of Active Listening in Initial Interactions , author =. International Journal of Listening , year =

  55. [55]

    arXiv preprint arXiv:2505.24863 , year=

    AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time , author=. arXiv preprint arXiv:2505.24863 , year=

  56. [56]

    arXiv preprint arXiv:2306.03100 , year=

    Rethinking model evaluation as narrowing the socio-technical gap , author=. arXiv preprint arXiv:2306.03100 , year=

  57. [57]

    PloS one , volume=

    Reading the mind in the eyes or reading between the lines? Theory of mind predicts collective intelligence equally well online and face-to-face , author=. PloS one , volume=. 2014 , publisher=

  58. [58]

    science , volume=

    Evidence for a collective intelligence factor in the performance of human groups , author=. science , volume=. 2010 , publisher=

  59. [59]

    Handbook of Human-Centered Artificial Intelligence , pages=

    Theory of Mind in Human-AI Interaction and AI , author=. Handbook of Human-Centered Artificial Intelligence , pages=. 2025 , publisher=

  60. [60]

    arXiv preprint arXiv:2501.15355 , year=

    Large language models as theory of mind aware generative agents with counterfactual reflection , author=. arXiv preprint arXiv:2501.15355 , year=

  61. [61]

    Infusing Theory of Mind into Socially Intelligent LLM Agents

    Infusing Theory of Mind into Socially Intelligent LLM Agents , author=. arXiv preprint arXiv:2509.22887 , year=

  62. [62]

    , author=

    A power primer. , author=. 2016 , publisher=

  63. [63]

    Proceedings of the 2016 CHI conference on human factors in computing systems , pages=

    Local standards for sample size at CHI , author=. Proceedings of the 2016 CHI conference on human factors in computing systems , pages=