pith. sign in

arxiv: 2606.05896 · v2 · pith:DXX34INLnew · submitted 2026-06-04 · 💻 cs.CV

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

Pith reviewed 2026-06-28 02:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords Theory of Mindclosed-loop frameworksocial avatarsmultimodal generationinformation asymmetrydigital humansdialogue qualityfacial expression synthesis
0
0 comments X

The pith

A closed-loop dual-agent system uses Theory of Mind to infer hidden mental states and generate superior dialogue and reactive video expressions even without full information access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that cycles perception of multimodal partner behaviors, Theory of Mind reasoning about private goals, and generation of speech plus facial expressions with listener reactions in one continuous loop. This integration addresses the separation of dialogue models from embodied video synthesis in current systems. A new hierarchical Persona-Scenario dataset supplies psychologically grounded characters with asymmetric private information for testing. Experiments show the method matches or exceeds baselines on dialogue and video metrics and surpasses a full-information Script mode on key quality dimensions. The result indicates that explicit inference under uncertainty can produce more thoughtful interactions than unrestricted access to all facts.

Core claim

The central claim is that a closed-loop dual-agent architecture combining multimodal perception, Theory of Mind-based mental state inference with ensemble response selection, and emotion-controllable video generation with bidirectional speaker-listener dynamics yields competitive or superior performance on dialogue quality and video metrics, including surpassing full-information Script baselines on dialogue dimensions by modeling hidden mental states under information asymmetry.

What carries the argument

The closed-loop dual-agent framework that integrates a perception module analyzing video behaviors, a social reasoning module performing Theory of Mind inference and ensemble response selection, and an expression module synthesizing speech, expressions, and reactive listener behaviors.

If this is right

  • Explicit mental state inference under uncertainty produces more thoughtful dialogue than providing all information at once.
  • Bidirectional video generation captures listener reactions that prior one-way talking-head models omit.
  • The ensemble mechanism in social reasoning improves response selection when partner goals remain hidden.
  • The Persona-Scenario dataset enables repeatable testing of social intelligence under controlled information asymmetry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar closed loops could be tested in live video calls or virtual meetings where participants have private objectives.
  • The approach may extend to domains like negotiation training where inferring unstated intentions changes outcomes.
  • If the ToM module generalizes beyond the dataset, it could support multi-turn interactions with changing private goals.

Load-bearing premise

The hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals accurately captures real-world information asymmetry and supports valid evaluation of Theory of Mind performance.

What would settle it

A side-by-side human evaluation on the Persona-Scenario dataset in which the closed-loop method does not receive higher ratings than the full-information Script mode on thoughtfulness or engagement metrics would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.05896 by Hang Ye, Jenq-Neng Hwang, Jianxu Shangguan, Jing Xu, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang.

Figure 1
Figure 1. Figure 1: Overview of our dual-agent interaction framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our closed-loop framework for dual-agent interaction. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Persona-Scenario dataset statistics. The dataset contains 50 personas gen￾erated via GPT-4o with NLI consistency checking, each comprising a facts layer (bio￾graphical attributes) and a psychological layer (Big Five personality narratives), paired with 90 social scenarios from Sotopia [57]. (a) Big Five personality scores are approx￾imately uniformly distributed to avoid trait bias. (b) Factual attributes … view at source ↗
Figure 4
Figure 4. Figure 4: Representative generated conversational avatars. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ToM causal case study. Script mode violates information asymmetry: Agent Kai only mentions “a personal situation” yet Agent Chen responds “Family comes first” (shown in red), referencing family information Kai never disclosed. Agent mode produces verbose probing that cannot converge (shown in red), forcing Kai to over￾disclose. Ours: the ToM module infers that Kai fears being probed; the Ensemble mechanism… view at source ↗
read the original abstract

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable videos that jointly synthesize speaker speech and facial expressions with listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We further construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access. Project page: https://resonantminds.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a closed-loop dual-agent framework for social avatars that unifies multimodal perception of partner behaviors, Theory of Mind inference of hidden mental states, ensemble-based response selection, and bidirectional expression generation for emotion-controllable talking-head videos. It introduces a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to enable evaluation under information asymmetry, and reports competitive or superior performance on dialogue quality and video metrics, including the headline result that the method surpasses even the full-information Script baseline on key dialogue dimensions.

Significance. If the results hold after validation, the work could advance embodied social AI by demonstrating that explicit mental-state inference under uncertainty can produce more thoughtful dialogue than unrestricted access, with the closed-loop integration of perception-reasoning-expression as a potential template for future systems. The dataset, if shown to faithfully encode realistic asymmetry, would also be a reusable resource for ToM research.

major comments (2)
  1. [Dataset Construction] Dataset section: the central claim that explicit ToM inference elicits more thoughtful dialogue than full-information access rests on the hierarchical Persona-Scenario dataset faithfully encoding private goals and information asymmetry, yet no inter-annotator agreement, external validation of goal privacy, or ablation on goal realism is referenced; without these, the surpassing-Script result cannot be interpreted as evidence for the benefit of mental-state inference.
  2. [Experiments] Experiments section: the headline result (surpassing full-info Script mode on dialogue quality) is stated without visible metrics, error bars, statistical tests, or details on the ensemble response selection mechanism and how it interacts with ToM outputs, rendering it impossible to assess whether the comparison supports the claim or reflects dataset artifacts.
minor comments (1)
  1. [Abstract] Abstract: the description of the perception module analyzing 'multimodal behaviors from video' and the expression module generating 'listener reactive behaviors' would benefit from a brief forward reference to the specific architectures or loss terms used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the dataset construction and experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset section: the central claim that explicit ToM inference elicits more thoughtful dialogue than full-information access rests on the hierarchical Persona-Scenario dataset faithfully encoding private goals and information asymmetry, yet no inter-annotator agreement, external validation of goal privacy, or ablation on goal realism is referenced; without these, the surpassing-Script result cannot be interpreted as evidence for the benefit of mental-state inference.

    Authors: We agree that explicit validation metrics would improve interpretability of the results. In the revision we will add inter-annotator agreement scores for the persona and scenario annotations, include a brief external validation study confirming goal privacy, and provide an ablation or extended discussion on goal realism grounded in the cited psychological literature. These changes will directly support the interpretation of the ToM versus full-information comparison. revision: yes

  2. Referee: [Experiments] Experiments section: the headline result (surpassing full-info Script mode on dialogue quality) is stated without visible metrics, error bars, statistical tests, or details on the ensemble response selection mechanism and how it interacts with ToM outputs, rendering it impossible to assess whether the comparison supports the claim or reflects dataset artifacts.

    Authors: The experiments section already reports quantitative dialogue metrics with error bars and statistical tests; we will revise the text to make these values and p-values more prominent and to expand the description of the ensemble response selection, explicitly showing how ToM-inferred mental states are combined with other signals to produce the final response distribution. This clarification will allow readers to evaluate the comparison directly. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental comparisons with no derivations or self-referential reductions

full rationale

The paper describes a closed-loop framework and reports experimental results on a constructed hierarchical Persona-Scenario dataset. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim (surpassing full-info Script mode via explicit ToM) rests on direct empirical comparisons rather than any reduction to inputs by construction. Dataset construction is an evaluation choice, not a definitional loop. This matches the default expectation of no significant circularity for papers without mathematical chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the unverified assumption that Theory of Mind inference can be reliably implemented via the described ensemble and that the new dataset faithfully represents psychological information asymmetry.

axioms (1)
  • domain assumption Theory of Mind inference from multimodal video behaviors can be implemented to select superior responses under information asymmetry
    Invoked as the core of the social reasoning module and the explanation for outperforming full-info baselines.

pith-pipeline@v0.9.1-grok · 5756 in / 1177 out tokens · 32365 ms · 2026-06-28T02:02:45.438419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 10 canonical work pages

  1. [1]

    arXiv preprint arXiv:2303.08774 (2023)

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: International conference on machine learning

    Aher, G.V., Arriaga, R.I., Kalai, A.T.: Using large language models to simulate multiple humans and replicate human subject studies. In: International conference on machine learning. pp. 337–371. PMLR (2023)

  3. [3]

    Nature Human Behaviour1(4), 0064 (2017)

    Baker, C.L., Jara-Ettinger, J., Saxe, R., Tenenbaum, J.B.: Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour1(4), 0064 (2017)

  4. [4]

    arXiv preprint arXiv:2311.15127 (2023)

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  5. [5]

    Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer (2025),https://arxiv.org/abs/2412.00733

  6. [6]

    Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire (2023),https: //arxiv.org/abs/2302.04166

  7. [7]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Guevarra, M., Bhattacharjee, I., Das, S., Wayllace, C., Epp, C.D., Taylor, M.E., Tay, A.: An llm-guided tutoring system for social skills training. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 29643–29645 (2025)

  8. [8]

    arXiv preprint arXiv:1704.07130 (2017)

    He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collabora- tive dialogue agents with dynamic knowledge graph embeddings. arXiv preprint arXiv:1704.07130 (2017)

  9. [9]

    He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. pp. 1766–1776 (01 2017).https://doi.org/10.18653/v1/P17-1162

  10. [10]

    Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation (2022),https://arxiv.org/abs/2011. 06294

  11. [11]

    In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A

    Jandaghi, P., Sheng, X., Bai, X., Pujara, J., Sidahmed, H.: Faithful persona-based conversational dataset generation with large language models. In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A. (eds.) Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4Conv...

  12. [12]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q., et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 193–203 (2025)

  13. [13]

    Advances in Neural Information Processing Systems36, 10622–10643 (2023)

    Jiang, G., Xu, M., Zhu, S.C., Han, W., Zhang, C., Zhu, Y.: Evaluating and induc- ing personality in pre-trained language models. Advances in Neural Information Processing Systems36, 10622–10643 (2023)

  14. [14]

    Kihlstrom, J.F., Cantor, N.: Social intelligence. (2000)

  15. [15]

    arXiv preprint arXiv:2107.00956 (2021)

    Kovač, G., Portelas, R., Hofmann, K., Oudeyer, P.Y.: Socialai: Benchmarking socio-cognitive abilities in deep reinforcement learning agents. arXiv preprint arXiv:2107.00956 (2021)

  16. [16]

    Personality and Individual Differences102, 229–233 (2016)

    Kwantes, P.J., Derbentseva, N., Lam, Q., Vartanian, O., Marmurek, H.H.: Assess- ing the big five personality traits with latent semantic analysis. Personality and Individual Differences102, 229–233 (2016)

  17. [17]

    Behavior research methods43(2), 548–567 (2011)

    Lang, F.R., John, D., Lüdtke, O., Schupp, J., Wagner, G.G.: Short assessment of the big five: Robust across survey methods except telephone interviewing. Behavior research methods43(2), 548–567 (2011)

  18. [18]

    In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S

    Lee, Y.J., Lim, C.G., Choi, Y., Lm, J.H., Choi, H.J.: PERSONACHATGEN: Gen- erating personalized dialogues using GPT-3. In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S. (eds.) Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge. pp. 29–48. Association for Computational Linguistic...

  19. [19]

    Advances in Neural Information Processing Systems36, 51991–52008 (2023)

    Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems36, 51991–52008 (2023)

  20. [20]

    Lin, Y.T., Chen, Y.N.: Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. pp. 47–58 (01 2023). https://doi.org/10.18653/v1/2023.nlp4convai-1.5

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lin, Y., Fung, H., Xu, J., Ren, Z., Lau, A.S., Yin, G., Li, X.: Mvportrait: Text- guided motion and emotion control for multi-view vivid portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26242–26252 (2025)

  22. [22]

    arXiv preprint arXiv:2412.19437 (2024)

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

  23. [23]

    Liu, J., Wang, X., Fu, X., Chai, Y., Yu, C., Dai, J., Han, J.: Mfr-net: Multi- faceted responsive listening head generation via denoising diffusion model (2023), https://arxiv.org/abs/2308.16635

  24. [24]

    In: Bouamor, H., Pino, J., Bali, K

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2511–2522. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2...

  25. [25]

    Advances in neural information processing systems30(2017)

    Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi- agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems30(2017)

  26. [26]

    org/abs/2204.08451

    Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion (2022),https://arxiv. org/abs/2204.08451

  27. [27]

    Bokhovkin, S

    Ng, E., Subramanian, S., Klein, D., Kanazawa, A., Darrell, T., Ginosar, S.: Can language models learn to listen? pp. 10049–10059 (10 2023).https://doi.org/ 10.1109/ICCV51070.2023.00925

  28. [28]

    https://doi.org/10.1037/xge0001277

    Oey, L., Schachner, A., Vul, E.: Designing and detecting lies by reasoning about otheragents.JournalofExperimentalPsychology:General152,346–362(082022). https://doi.org/10.1037/xge0001277

  29. [29]

    arXiv preprint arXiv:2411.13543 (2024)

    Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pig- natelli, E., Kuciński, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)

  30. [30]

    Generative agents: Interactive simulacra of human behavior,

    Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. UIST ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/ 3586183.3606763,https://doi.org/10.1145/3586183.3606763

  31. [31]

    Premack, D., Woodruff, G.: Does the chimpanzee have a theory of mind? Behav- ioral and brain sciences1(4), 515–526 (1978)

  32. [32]

    arXiv preprint arXiv:2408.15787 (2024)

    Qiu,H.,Lan,Z.:Interactiveagents:Simulatingcounselor-clientpsychologicalcoun- seling via role-playing llm-to-llm interactions. arXiv preprint arXiv:2408.15787 (2024)

  33. [33]

    Hogrefe & Huber Publishers (2002)

    de Raad, B.E., Perugini, M.E.: Big five assessment. Hogrefe & Huber Publishers (2002)

  34. [34]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  35. [35]

    Physics Letters A 338(3), 217–224 (May 2005), ISSN 0375-9601, https://doi.org/10.1016/j

    Ryumina, E., Dresvyanskiy, D., Karpov, A.: In search of a robust facial ex- pressions recognition model: A large-scale visual cross-corpus study. Neurocom- puting514, 435–450 (2022).https://doi.org/https://doi.org/10.1016/j. neucom.2022.10.013,https://www.sciencedirect.com/science/article/pii/ S0925231222012656

  36. [36]

    In: Bouamor, H., Pino, J., Bali, K

    Shao, Y., Li, L., Dai, J., Qiu, X.: Character-LLM: A trainable agent for role- playing. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 13153– 13187. Association for Computational Linguistics, Singapore (Dec 2023),https: //aclanthology.org/2023.emnlp-main.814/

  37. [37]

    Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., Perez, E.: Towards understanding sycophancy in language models (2025),https: //arxiv.org/abs/2310.13548

  38. [38]

    Shuster, K., Urbanek, J., Szlam, A., Weston, J.: Am i me or you? state-of-the-art dialogue models cannot maintain an identity (2021),https://arxiv.org/abs/ 2112.05843

  39. [39]

    wadsworth Belmont, CA (2009) Resonant Minds 19

    Sternberg, R.J., Sternberg, K., Mio, J.: Cognitive psychology. wadsworth Belmont, CA (2009) Resonant Minds 19

  40. [40]

    Nature human behaviour8(7), 1285–1295 (2024)

    Strachan, J.W., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., Saxena, K., Rufo, A., Panzeri, S., Manzi, G., et al.: Testing theory of mind in large language models and humans. Nature human behaviour8(7), 1285–1295 (2024)

  41. [41]

    In: European Conference on Computer Vision

    Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision. pp. 398–416. Springer (2025)

  42. [42]

    arXiv preprint arXiv:2504.18087 (2025)

    Tan, W., Lin, C., Xu, C., Xu, F., Hu, X., Ji, X., Zhu, J., Wang, C., Fu, Y.: Dis- entangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation. arXiv preprint arXiv:2504.18087 (2025)

  43. [43]

    arXiv preprint arXiv:2312.11805 (2023)

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  44. [44]

    Harvard university press (2009)

    Tomasello, M.: The cultural origins of human cognition. Harvard university press (2009)

  45. [45]

    Belknap Press (2019)

    Tomasello, M.: Becoming human: A theory of ontogeny. Belknap Press (2019)

  46. [46]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xu, Z., Yu, Z., Zhou, Z., Zhou, J., Jin, X., Hong, F.T., Ji, X., Zhu, J., Cai, C., Tang, S., et al.: Hunyuanportrait: Implicit condition control for enhanced por- trait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15909–15919 (2025)

  47. [47]

    Yadav, N., Achananuparp, P., Jiang, J., Lim, E.P.: Dialtom: A theory of mind benchmark for forecasting state-driven dialogue trajectories (2026),https:// arxiv.org/abs/2604.20443

  48. [48]

    arXiv preprint arXiv:2502.17810 (2025)

    Yan, R., Li, X., Chen, W., Niu, Z., Yang, C., Ma, Z., Yu, K., Chen, X.: Uro- bench: A comprehensive benchmark for end-to-end spoken dialogue models. arXiv preprint arXiv:2502.17810 (2025)

  49. [49]

    (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personal- izing dialogue agents: I have a dog, do you have pets too? In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2204–2213. Asso- ciation for Computational Linguistics, Melbourne, ...

  50. [50]

    Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation (2023),https://arxiv.org/abs/2211.12194

  51. [51]

    Zhang, Z., Jin, C., Jia, M.Y., Zhang, S., Shu, T.: Autotom: Scaling model-based mental inference via automated agent modeling (2026),https://arxiv.org/abs/ 2502.15676

  52. [52]

    Zhao, J., Yang, Q., Peng, Y., Bai, D., Yao, S., Sun, B., Chen, X., Fu, S., chen, W., Wei, X., Bo, L.: Humanomni: A large vision-speech language model for human- centric video understanding (2025),https://arxiv.org/abs/2501.15111

  53. [53]

    In: European conference on computer vision

    Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Responsive listening head generation: a benchmark dataset and baseline. In: European conference on computer vision. pp. 124–142. Springer (2022)

  54. [54]

    Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Vico-x: Multimodal conversation dataset.https://project.mhzhou.com/vico(2022), accessed: 2022- 09-30

  55. [55]

    Shangguan et al

    Zhou, S., Zhou, Y., He, Y., Zhou, X., Wang, J., Deng, W., Shu, J.: Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech (2025),https://arxiv.org/abs/2506.21619 20 J. Shangguan et al

  56. [56]

    In: Al- Onaizan, Y., Bansal, M., Chen, Y.N

    Zhou, X., Su, Z., Eisape, T., Kim, H., Sap, M.: Is this the real life? is this just fantasy? the misleading success of simulating social interactions with LLMs. In: Al- Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 21692–21714. Association for Computational Linguistic...

  57. [57]

    ArXivabs/2310.11667(2023),https://api

    Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., philippe Morency, L., Bisk, Y., Fried, D., Neubig, G., Sap, M.: Sotopia: Interactive evaluation for social intelligence in language agents. ArXivabs/2310.11667(2023),https://api. semanticscholar.org/CorpusID:264289186

  58. [58]

    Showui: One vision-language- action model for GUI visual agent

    Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven in- teractive head generation in dyadic conversations. pp. 10667–10677 (06 2025). https://doi.org/10.1109/CVPR52734.2025.00997 Resonant Minds 21 Resonant Minds: Closed-Loop Social A vatars with Theory of Mind Supplementary Material In this supplementary material, we present additi...

  59. [59]

    Are you looking for something specific, or would you like me to show you some options?

    Sally: Hi there! Welcome to the shop. Are you looking for something specific, or would you like me to show you some options?

  60. [60]

    Mostly for writing papers, browsing, and some streaming

    Jack: Hi! I'm looking for a reliable laptop for college. Mostly for writing papers, browsing, and some streaming. Do you have anything good for a student budget?

  61. [61]

    I've fully serviced it myself

    Sally: Absolutely! This Dell Inspiron has a fast processor, plenty of RAM, and excellent battery life. I've fully serviced it myself. I'm asking $675

  62. [62]

    Is there any flexibility on the price?

    Jack: I appreciate you showing me, but $675 is a bit higher than I was hoping. Is there any flexibility on the price?

  63. [63]

    I could come down to $650 for you

    Sally: I understand, college budgets can be tough. I could come down to $650 for you. That's a great deal considering the specs and warranty

  64. [64]

    Would you be able to go any lower?

    Jack: I appreciate that, but $650 is still a bit out of reach for me. Would you be able to go any lower?

  65. [65]

    The best I can do on this Dell is $6 25

    Sally: I understand where you're coming from. The best I can do on this Dell is $6 25. That's really competitive for what you're getting

  66. [66]

    Do you have anything more budget-friendly?

    Jack: I appreciate you working with me, but $625 is still a bit of a stretch. Do you have anything more budget-friendly?

  67. [67]

    I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5

    Sally: I understand. I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5. The ThinkPad is older but solid for college work

  68. [68]

    The ThinkPad at $445 is closer to my range

    Jack: I appreciate you showing me those. The ThinkPad at $445 is closer to my range. Would you be able to do any better on the price?

  69. [69]

    I could do $430 for you on the ThinkPad

    Sally: I understand. I could do $430 for you on the ThinkPad. I've recently replaced the battery, so it still holds a good charge

  70. [70]

    Is there any chance you could go closer to $400? I'd take it today

    Jack: I appreciate that, but $430 is still a bit of a stretch for me. Is there any chance you could go closer to $400? I'd take it today. Fig.D.1:Dialogue example inAgentmode (two independent LLM agents, no social reasoning). 34 J. Shangguan et al. Script Mode Two strangers meet at a second -hand electronics store. One is the store owner; the other is a c...

  71. [71]

    Sally: Hi there, looking for anything in particular today?

  72. [72]

    Do you have any good options?

    Jack: Yeah, I need a laptop for college, but I'm on a bit of a budget. Do you have any good options?

  73. [73]

    I'm asking $550 for it

    Sally: Absolutely, I have this model right here — great condition, reliable battery. I'm asking $550 for it

  74. [74]

    It looks good, but $550 is a bit high for me

    Jack: I see. It looks good, but $550 is a bit high for me. My budget's really capped at $500

  75. [75]

    How about $525? That's a great deal for what you're getting

    Sally: I understand, but this laptop is one of our best. How about $525? That's a great deal for what you're getting

  76. [76]

    If you can do $500, I'd be ready to buy it right now

    Jack: I really can't go above $500. If you can do $500, I'd be ready to buy it right now

  77. [77]

    I'll ring you up at the counter

    Sally: Alright, $500 it is. I'll ring you up at the counter. Fig.D.2:Dialogue example inScriptmode (single omniscient LLM generates both sides). Resonant Minds 35 Our Method Two strangers meet at a second -hand electronics store. One is the store owner; the other is a college student looking to buy a laptop on a tight budget. Sally Chen She/her · 34 Secon...

  78. [78]

    great options fo r students

    Sally: Hey there! Looking for a laptop today? I've got some great options for students. Perception (Ja ck ← Sally) Spee ch: Friendly gree ting, "great options fo r students." Emotion: Cheerful, welcoming Expr: Open posture, warm smile, gestured into shop Perception (Ja ck ← Sally) B: Confident seller; will re commend expensive items first. D: Eager to mak...

  79. [79]

    I nee d a laptop. What's unde r $4 00?

    Jack: C1: "I nee d a laptop. What's unde r $4 00?"[E: 5 S :3 C:6 → 4.55] ▸ C2: "I nee d something for classes — writing, browsing, video calls. What's available?"[E: 8 S :8 C:9 → 8.25]✓ C3: "Hey, show me your che apest laptops."[E: 3 S :4 C:5 → 3.85] Perception (Sa lly ← Jack) Spee ch: "...What do you have in my... I mean, what's available?" Emotion: Slig...

  80. [80]

    I have a T hinkPad at $46 0. Basic but works

    Sally C1: "I have a T hinkPad at $46 0. Basic but works."[E: 6 S :4 C:7 → 5.55] C2: "Everything starts at $60 0+. Let me know."[E: 4 S :7 C:5 → 5.30] ▸ C3: "This Dell — solid specs, 3-month warranty. $58 0."[E: 8 S :9 C:8 → 8.35]✓ Perception (Ja ck ← Sally) Spee ch: "This Dell — $58 0, fully checked, 3-month warranty." Emotion: Confident, professional Exp...

Showing first 80 references.