Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

Hang Ye; Jenq-Neng Hwang; Jianxu Shangguan; Jing Xu; Wentao Zhu; Xiaoxuan Ma; Yizhou Wang

arxiv: 2606.05896 · v2 · pith:DXX34INLnew · submitted 2026-06-04 · 💻 cs.CV

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

Jianxu Shangguan , Jing Xu , Hang Ye , Xiaoxuan Ma , Yizhou Wang , Jenq-Neng Hwang , Wentao Zhu This is my paper

Pith reviewed 2026-06-28 02:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords Theory of Mindclosed-loop frameworksocial avatarsmultimodal generationinformation asymmetrydigital humansdialogue qualityfacial expression synthesis

0 comments

The pith

A closed-loop dual-agent system uses Theory of Mind to infer hidden mental states and generate superior dialogue and reactive video expressions even without full information access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that cycles perception of multimodal partner behaviors, Theory of Mind reasoning about private goals, and generation of speech plus facial expressions with listener reactions in one continuous loop. This integration addresses the separation of dialogue models from embodied video synthesis in current systems. A new hierarchical Persona-Scenario dataset supplies psychologically grounded characters with asymmetric private information for testing. Experiments show the method matches or exceeds baselines on dialogue and video metrics and surpasses a full-information Script mode on key quality dimensions. The result indicates that explicit inference under uncertainty can produce more thoughtful interactions than unrestricted access to all facts.

Core claim

The central claim is that a closed-loop dual-agent architecture combining multimodal perception, Theory of Mind-based mental state inference with ensemble response selection, and emotion-controllable video generation with bidirectional speaker-listener dynamics yields competitive or superior performance on dialogue quality and video metrics, including surpassing full-information Script baselines on dialogue dimensions by modeling hidden mental states under information asymmetry.

What carries the argument

The closed-loop dual-agent framework that integrates a perception module analyzing video behaviors, a social reasoning module performing Theory of Mind inference and ensemble response selection, and an expression module synthesizing speech, expressions, and reactive listener behaviors.

If this is right

Explicit mental state inference under uncertainty produces more thoughtful dialogue than providing all information at once.
Bidirectional video generation captures listener reactions that prior one-way talking-head models omit.
The ensemble mechanism in social reasoning improves response selection when partner goals remain hidden.
The Persona-Scenario dataset enables repeatable testing of social intelligence under controlled information asymmetry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar closed loops could be tested in live video calls or virtual meetings where participants have private objectives.
The approach may extend to domains like negotiation training where inferring unstated intentions changes outcomes.
If the ToM module generalizes beyond the dataset, it could support multi-turn interactions with changing private goals.

Load-bearing premise

The hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals accurately captures real-world information asymmetry and supports valid evaluation of Theory of Mind performance.

What would settle it

A side-by-side human evaluation on the Persona-Scenario dataset in which the closed-loop method does not receive higher ratings than the full-information Script mode on thoughtfulness or engagement metrics would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.05896 by Hang Ye, Jenq-Neng Hwang, Jianxu Shangguan, Jing Xu, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang.

**Figure 2.** Figure 2: Overview of our closed-loop framework for dual-agent interaction. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Persona-Scenario dataset statistics. The dataset contains 50 personas generated via GPT-4o with NLI consistency checking, each comprising a facts layer (biographical attributes) and a psychological layer (Big Five personality narratives), paired with 90 social scenarios from Sotopia [57]. (a) Big Five personality scores are approximately uniformly distributed to avoid trait bias. (b) Factual attributes … view at source ↗

**Figure 4.** Figure 4: Representative generated conversational avatars. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: ToM causal case study. Script mode violates information asymmetry: Agent Kai only mentions “a personal situation” yet Agent Chen responds “Family comes first” (shown in red), referencing family information Kai never disclosed. Agent mode produces verbose probing that cannot converge (shown in red), forcing Kai to overdisclose. Ours: the ToM module infers that Kai fears being probed; the Ensemble mechanism… view at source ↗

read the original abstract

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable videos that jointly synthesize speaker speech and facial expressions with listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We further construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access. Project page: https://resonantminds.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The closed-loop ToM avatar framework combines perception, reasoning, and video generation in a single cycle, but the claim that explicit mental-state inference beats full-information access rests on an unverified dataset.

read the letter

The paper's core move is a dual-agent loop that takes video input, runs Theory of Mind inference on hidden goals, selects responses via ensemble, and outputs emotion-controllable talking-head video with listener reactions. That closed loop plus the bidirectional expression module is the concrete engineering step beyond separate dialogue and generation pipelines.

They also release a hierarchical Persona-Scenario dataset built around psychologically motivated personas and private social goals. The headline result is that their ToM-under-uncertainty path scores higher on dialogue quality than a full-information Script baseline. If the dataset really forces genuine inference problems, that would be a useful data point for social-agent work.

The integration itself looks like a straightforward but coherent assembly of existing components. The project page and the fact that they ship both dialogue and video metrics give it some practical value for people already working on multimodal avatars.

The main weakness is the evaluation foundation. The outperformance claim depends on the dataset faithfully creating information asymmetry that matches real social demands. No mention of inter-annotator checks on goal privacy, external validation against human social scenarios, or ablations that test whether the personas are realistic rather than artifact-prone. Without those, the comparison could simply reflect how the ensemble exploits the particular construction of the data. The abstract gives no error bars, sample sizes, or exclusion rules, so the numbers are hard to interpret even if the full paper supplies them.

This is for groups already building social avatars or ToM-augmented agents who need a concrete example of the loop. It is not ready for broad citation until the dataset and metrics are stress-tested. I would send it to peer review so referees can examine the implementation details and the dataset construction directly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a closed-loop dual-agent framework for social avatars that unifies multimodal perception of partner behaviors, Theory of Mind inference of hidden mental states, ensemble-based response selection, and bidirectional expression generation for emotion-controllable talking-head videos. It introduces a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to enable evaluation under information asymmetry, and reports competitive or superior performance on dialogue quality and video metrics, including the headline result that the method surpasses even the full-information Script baseline on key dialogue dimensions.

Significance. If the results hold after validation, the work could advance embodied social AI by demonstrating that explicit mental-state inference under uncertainty can produce more thoughtful dialogue than unrestricted access, with the closed-loop integration of perception-reasoning-expression as a potential template for future systems. The dataset, if shown to faithfully encode realistic asymmetry, would also be a reusable resource for ToM research.

major comments (2)

[Dataset Construction] Dataset section: the central claim that explicit ToM inference elicits more thoughtful dialogue than full-information access rests on the hierarchical Persona-Scenario dataset faithfully encoding private goals and information asymmetry, yet no inter-annotator agreement, external validation of goal privacy, or ablation on goal realism is referenced; without these, the surpassing-Script result cannot be interpreted as evidence for the benefit of mental-state inference.
[Experiments] Experiments section: the headline result (surpassing full-info Script mode on dialogue quality) is stated without visible metrics, error bars, statistical tests, or details on the ensemble response selection mechanism and how it interacts with ToM outputs, rendering it impossible to assess whether the comparison supports the claim or reflects dataset artifacts.

minor comments (1)

[Abstract] Abstract: the description of the perception module analyzing 'multimodal behaviors from video' and the expression module generating 'listener reactive behaviors' would benefit from a brief forward reference to the specific architectures or loss terms used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the dataset construction and experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset Construction] Dataset section: the central claim that explicit ToM inference elicits more thoughtful dialogue than full-information access rests on the hierarchical Persona-Scenario dataset faithfully encoding private goals and information asymmetry, yet no inter-annotator agreement, external validation of goal privacy, or ablation on goal realism is referenced; without these, the surpassing-Script result cannot be interpreted as evidence for the benefit of mental-state inference.

Authors: We agree that explicit validation metrics would improve interpretability of the results. In the revision we will add inter-annotator agreement scores for the persona and scenario annotations, include a brief external validation study confirming goal privacy, and provide an ablation or extended discussion on goal realism grounded in the cited psychological literature. These changes will directly support the interpretation of the ToM versus full-information comparison. revision: yes
Referee: [Experiments] Experiments section: the headline result (surpassing full-info Script mode on dialogue quality) is stated without visible metrics, error bars, statistical tests, or details on the ensemble response selection mechanism and how it interacts with ToM outputs, rendering it impossible to assess whether the comparison supports the claim or reflects dataset artifacts.

Authors: The experiments section already reports quantitative dialogue metrics with error bars and statistical tests; we will revise the text to make these values and p-values more prominent and to expand the description of the ensemble response selection, explicitly showing how ToM-inferred mental states are combined with other signals to produce the final response distribution. This clarification will allow readers to evaluate the comparison directly. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental comparisons with no derivations or self-referential reductions

full rationale

The paper describes a closed-loop framework and reports experimental results on a constructed hierarchical Persona-Scenario dataset. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim (surpassing full-info Script mode via explicit ToM) rests on direct empirical comparisons rather than any reduction to inputs by construction. Dataset construction is an evaluation choice, not a definitional loop. This matches the default expectation of no significant circularity for papers without mathematical chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the unverified assumption that Theory of Mind inference can be reliably implemented via the described ensemble and that the new dataset faithfully represents psychological information asymmetry.

axioms (1)

domain assumption Theory of Mind inference from multimodal video behaviors can be implemented to select superior responses under information asymmetry
Invoked as the core of the social reasoning module and the explanation for outperforming full-info baselines.

pith-pipeline@v0.9.1-grok · 5756 in / 1177 out tokens · 32365 ms · 2026-06-28T02:02:45.438419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 10 canonical work pages

[1]

arXiv preprint arXiv:2303.08774 (2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023
[2]

In: International conference on machine learning

Aher, G.V., Arriaga, R.I., Kalai, A.T.: Using large language models to simulate multiple humans and replicate human subject studies. In: International conference on machine learning. pp. 337–371. PMLR (2023)

2023
[3]

Nature Human Behaviour1(4), 0064 (2017)

Baker, C.L., Jara-Ettinger, J., Saxe, R., Tenenbaum, J.B.: Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour1(4), 0064 (2017)

2017
[4]

arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023
[5]

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer (2025),https://arxiv.org/abs/2412.00733

arXiv 2025
[6]

Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire (2023),https: //arxiv.org/abs/2302.04166

Pith/arXiv arXiv 2023
[7]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Guevarra, M., Bhattacharjee, I., Das, S., Wayllace, C., Epp, C.D., Taylor, M.E., Tay, A.: An llm-guided tutoring system for social skills training. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 29643–29645 (2025)

2025
[8]

arXiv preprint arXiv:1704.07130 (2017)

He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collabora- tive dialogue agents with dynamic knowledge graph embeddings. arXiv preprint arXiv:1704.07130 (2017)

Pith/arXiv arXiv 2017
[9]

He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. pp. 1766–1776 (01 2017).https://doi.org/10.18653/v1/P17-1162

work page doi:10.18653/v1/p17-1162 2017
[10]

Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation (2022),https://arxiv.org/abs/2011. 06294

2022
[11]

In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A

Jandaghi, P., Sheng, X., Bai, X., Pujara, J., Sidahmed, H.: Faithful persona-based conversational dataset generation with large language models. In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A. (eds.) Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4Conv...

2024
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q., et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 193–203 (2025)

2025
[13]

Advances in Neural Information Processing Systems36, 10622–10643 (2023)

Jiang, G., Xu, M., Zhu, S.C., Han, W., Zhang, C., Zhu, Y.: Evaluating and induc- ing personality in pre-trained language models. Advances in Neural Information Processing Systems36, 10622–10643 (2023)

2023
[14]

Kihlstrom, J.F., Cantor, N.: Social intelligence. (2000)

2000
[15]

arXiv preprint arXiv:2107.00956 (2021)

Kovač, G., Portelas, R., Hofmann, K., Oudeyer, P.Y.: Socialai: Benchmarking socio-cognitive abilities in deep reinforcement learning agents. arXiv preprint arXiv:2107.00956 (2021)

arXiv 2021
[16]

Personality and Individual Differences102, 229–233 (2016)

Kwantes, P.J., Derbentseva, N., Lam, Q., Vartanian, O., Marmurek, H.H.: Assess- ing the big five personality traits with latent semantic analysis. Personality and Individual Differences102, 229–233 (2016)

2016
[17]

Behavior research methods43(2), 548–567 (2011)

Lang, F.R., John, D., Lüdtke, O., Schupp, J., Wagner, G.G.: Short assessment of the big five: Robust across survey methods except telephone interviewing. Behavior research methods43(2), 548–567 (2011)

2011
[18]

In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S

Lee, Y.J., Lim, C.G., Choi, Y., Lm, J.H., Choi, H.J.: PERSONACHATGEN: Gen- erating personalized dialogues using GPT-3. In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S. (eds.) Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge. pp. 29–48. Association for Computational Linguistic...

2022
[19]

Advances in Neural Information Processing Systems36, 51991–52008 (2023)

Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems36, 51991–52008 (2023)

2023
[20]

Lin, Y.T., Chen, Y.N.: Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. pp. 47–58 (01 2023). https://doi.org/10.18653/v1/2023.nlp4convai-1.5

work page doi:10.18653/v1/2023.nlp4convai-1.5 2023
[21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lin, Y., Fung, H., Xu, J., Ren, Z., Lau, A.S., Yin, G., Li, X.: Mvportrait: Text- guided motion and emotion control for multi-view vivid portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26242–26252 (2025)

2025
[22]

arXiv preprint arXiv:2412.19437 (2024)

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

Pith/arXiv arXiv 2024
[23]

Liu, J., Wang, X., Fu, X., Chai, Y., Yu, C., Dai, J., Han, J.: Mfr-net: Multi- faceted responsive listening head generation via denoising diffusion model (2023), https://arxiv.org/abs/2308.16635

arXiv 2023
[24]

In: Bouamor, H., Pino, J., Bali, K

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2511–2522. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2...

work page doi:10.18653/v1/2023.emnlp- 2023
[25]

Advances in neural information processing systems30(2017)

Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi- agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems30(2017)

2017
[26]

org/abs/2204.08451

Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion (2022),https://arxiv. org/abs/2204.08451

arXiv 2022
[27]

Bokhovkin, S

Ng, E., Subramanian, S., Klein, D., Kanazawa, A., Darrell, T., Ginosar, S.: Can language models learn to listen? pp. 10049–10059 (10 2023).https://doi.org/ 10.1109/ICCV51070.2023.00925

work page doi:10.1109/iccv51070.2023.00925 2023
[28]

https://doi.org/10.1037/xge0001277

Oey, L., Schachner, A., Vul, E.: Designing and detecting lies by reasoning about otheragents.JournalofExperimentalPsychology:General152,346–362(082022). https://doi.org/10.1037/xge0001277

work page doi:10.1037/xge0001277
[29]

arXiv preprint arXiv:2411.13543 (2024)

Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pig- natelli, E., Kuciński, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)

arXiv 2024
[30]

Generative agents: Interactive simulacra of human behavior,

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. UIST ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/ 3586183.3606763,https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[31]

Premack, D., Woodruff, G.: Does the chimpanzee have a theory of mind? Behav- ioral and brain sciences1(4), 515–526 (1978)

1978
[32]

arXiv preprint arXiv:2408.15787 (2024)

Qiu,H.,Lan,Z.:Interactiveagents:Simulatingcounselor-clientpsychologicalcoun- seling via role-playing llm-to-llm interactions. arXiv preprint arXiv:2408.15787 (2024)

Pith/arXiv arXiv 2024
[33]

Hogrefe & Huber Publishers (2002)

de Raad, B.E., Perugini, M.E.: Big five assessment. Hogrefe & Huber Publishers (2002)

2002
[34]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[35]

Physics Letters A 338(3), 217–224 (May 2005), ISSN 0375-9601, https://doi.org/10.1016/j

Ryumina, E., Dresvyanskiy, D., Karpov, A.: In search of a robust facial ex- pressions recognition model: A large-scale visual cross-corpus study. Neurocom- puting514, 435–450 (2022).https://doi.org/https://doi.org/10.1016/j. neucom.2022.10.013,https://www.sciencedirect.com/science/article/pii/ S0925231222012656

work page doi:10.1016/j 2022
[36]

In: Bouamor, H., Pino, J., Bali, K

Shao, Y., Li, L., Dai, J., Qiu, X.: Character-LLM: A trainable agent for role- playing. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 13153– 13187. Association for Computational Linguistics, Singapore (Dec 2023),https: //aclanthology.org/2023.emnlp-main.814/

2023
[37]

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., Perez, E.: Towards understanding sycophancy in language models (2025),https: //arxiv.org/abs/2310.13548

Pith/arXiv arXiv 2025
[38]

Shuster, K., Urbanek, J., Szlam, A., Weston, J.: Am i me or you? state-of-the-art dialogue models cannot maintain an identity (2021),https://arxiv.org/abs/ 2112.05843

arXiv 2021
[39]

wadsworth Belmont, CA (2009) Resonant Minds 19

Sternberg, R.J., Sternberg, K., Mio, J.: Cognitive psychology. wadsworth Belmont, CA (2009) Resonant Minds 19

2009
[40]

Nature human behaviour8(7), 1285–1295 (2024)

Strachan, J.W., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., Saxena, K., Rufo, A., Panzeri, S., Manzi, G., et al.: Testing theory of mind in large language models and humans. Nature human behaviour8(7), 1285–1295 (2024)

2024
[41]

In: European Conference on Computer Vision

Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision. pp. 398–416. Springer (2025)

2025
[42]

arXiv preprint arXiv:2504.18087 (2025)

Tan, W., Lin, C., Xu, C., Xu, F., Hu, X., Ji, X., Zhu, J., Wang, C., Fu, Y.: Dis- entangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation. arXiv preprint arXiv:2504.18087 (2025)

arXiv 2025
[43]

arXiv preprint arXiv:2312.11805 (2023)

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

Pith/arXiv arXiv 2023
[44]

Harvard university press (2009)

Tomasello, M.: The cultural origins of human cognition. Harvard university press (2009)

2009
[45]

Belknap Press (2019)

Tomasello, M.: Becoming human: A theory of ontogeny. Belknap Press (2019)

2019
[46]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, Z., Yu, Z., Zhou, Z., Zhou, J., Jin, X., Hong, F.T., Ji, X., Zhu, J., Cai, C., Tang, S., et al.: Hunyuanportrait: Implicit condition control for enhanced por- trait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15909–15919 (2025)

2025
[47]

Yadav, N., Achananuparp, P., Jiang, J., Lim, E.P.: Dialtom: A theory of mind benchmark for forecasting state-driven dialogue trajectories (2026),https:// arxiv.org/abs/2604.20443

Pith/arXiv arXiv 2026
[48]

arXiv preprint arXiv:2502.17810 (2025)

Yan, R., Li, X., Chen, W., Niu, Z., Yang, C., Ma, Z., Yu, K., Chen, X.: Uro- bench: A comprehensive benchmark for end-to-end spoken dialogue models. arXiv preprint arXiv:2502.17810 (2025)

arXiv 2025
[49]

(eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personal- izing dialogue agents: I have a dog, do you have pets too? In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2204–2213. Asso- ciation for Computational Linguistics, Melbourne, ...

work page doi:10.18653/v1/p18-1205 2018
[50]

Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation (2023),https://arxiv.org/abs/2211.12194

arXiv 2023
[51]

Zhang, Z., Jin, C., Jia, M.Y., Zhang, S., Shu, T.: Autotom: Scaling model-based mental inference via automated agent modeling (2026),https://arxiv.org/abs/ 2502.15676

arXiv 2026
[52]

Zhao, J., Yang, Q., Peng, Y., Bai, D., Yao, S., Sun, B., Chen, X., Fu, S., chen, W., Wei, X., Bo, L.: Humanomni: A large vision-speech language model for human- centric video understanding (2025),https://arxiv.org/abs/2501.15111

arXiv 2025
[53]

In: European conference on computer vision

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Responsive listening head generation: a benchmark dataset and baseline. In: European conference on computer vision. pp. 124–142. Springer (2022)

2022
[54]

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Vico-x: Multimodal conversation dataset.https://project.mhzhou.com/vico(2022), accessed: 2022- 09-30

2022
[55]

Shangguan et al

Zhou, S., Zhou, Y., He, Y., Zhou, X., Wang, J., Deng, W., Shu, J.: Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech (2025),https://arxiv.org/abs/2506.21619 20 J. Shangguan et al

arXiv 2025
[56]

In: Al- Onaizan, Y., Bansal, M., Chen, Y.N

Zhou, X., Su, Z., Eisape, T., Kim, H., Sap, M.: Is this the real life? is this just fantasy? the misleading success of simulating social interactions with LLMs. In: Al- Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 21692–21714. Association for Computational Linguistic...

work page doi:10.18653/v1/2024.emnlp- 2024
[57]

ArXivabs/2310.11667(2023),https://api

Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., philippe Morency, L., Bisk, Y., Fried, D., Neubig, G., Sap, M.: Sotopia: Interactive evaluation for social intelligence in language agents. ArXivabs/2310.11667(2023),https://api. semanticscholar.org/CorpusID:264289186

arXiv 2023
[58]

Showui: One vision-language- action model for GUI visual agent

Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven in- teractive head generation in dyadic conversations. pp. 10667–10677 (06 2025). https://doi.org/10.1109/CVPR52734.2025.00997 Resonant Minds 21 Resonant Minds: Closed-Loop Social A vatars with Theory of Mind Supplementary Material In this supplementary material, we present additi...

work page doi:10.1109/cvpr52734.2025.00997 2025
[59]

Are you looking for something specific, or would you like me to show you some options?

Sally: Hi there! Welcome to the shop. Are you looking for something specific, or would you like me to show you some options?
[60]

Mostly for writing papers, browsing, and some streaming

Jack: Hi! I'm looking for a reliable laptop for college. Mostly for writing papers, browsing, and some streaming. Do you have anything good for a student budget?
[61]

I've fully serviced it myself

Sally: Absolutely! This Dell Inspiron has a fast processor, plenty of RAM, and excellent battery life. I've fully serviced it myself. I'm asking $675
[62]

Is there any flexibility on the price?

Jack: I appreciate you showing me, but $675 is a bit higher than I was hoping. Is there any flexibility on the price?
[63]

I could come down to $650 for you

Sally: I understand, college budgets can be tough. I could come down to $650 for you. That's a great deal considering the specs and warranty
[64]

Would you be able to go any lower?

Jack: I appreciate that, but $650 is still a bit out of reach for me. Would you be able to go any lower?
[65]

The best I can do on this Dell is $6 25

Sally: I understand where you're coming from. The best I can do on this Dell is $6 25. That's really competitive for what you're getting
[66]

Do you have anything more budget-friendly?

Jack: I appreciate you working with me, but $625 is still a bit of a stretch. Do you have anything more budget-friendly?
[67]

I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5

Sally: I understand. I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5. The ThinkPad is older but solid for college work
[68]

The ThinkPad at $445 is closer to my range

Jack: I appreciate you showing me those. The ThinkPad at $445 is closer to my range. Would you be able to do any better on the price?
[69]

I could do $430 for you on the ThinkPad

Sally: I understand. I could do $430 for you on the ThinkPad. I've recently replaced the battery, so it still holds a good charge
[70]

Is there any chance you could go closer to $400? I'd take it today

Jack: I appreciate that, but $430 is still a bit of a stretch for me. Is there any chance you could go closer to $400? I'd take it today. Fig.D.1:Dialogue example inAgentmode (two independent LLM agents, no social reasoning). 34 J. Shangguan et al. Script Mode Two strangers meet at a second -hand electronics store. One is the store owner; the other is a c...
[71]

Sally: Hi there, looking for anything in particular today?
[72]

Do you have any good options?

Jack: Yeah, I need a laptop for college, but I'm on a bit of a budget. Do you have any good options?
[73]

I'm asking $550 for it

Sally: Absolutely, I have this model right here — great condition, reliable battery. I'm asking $550 for it
[74]

It looks good, but $550 is a bit high for me

Jack: I see. It looks good, but $550 is a bit high for me. My budget's really capped at $500
[75]

How about $525? That's a great deal for what you're getting

Sally: I understand, but this laptop is one of our best. How about $525? That's a great deal for what you're getting
[76]

If you can do $500, I'd be ready to buy it right now

Jack: I really can't go above $500. If you can do $500, I'd be ready to buy it right now
[77]

I'll ring you up at the counter

Sally: Alright, $500 it is. I'll ring you up at the counter. Fig.D.2:Dialogue example inScriptmode (single omniscient LLM generates both sides). Resonant Minds 35 Our Method Two strangers meet at a second -hand electronics store. One is the store owner; the other is a college student looking to buy a laptop on a tight budget. Sally Chen She/her · 34 Secon...
[78]

great options fo r students

Sally: Hey there! Looking for a laptop today? I've got some great options for students. Perception (Ja ck ← Sally) Spee ch: Friendly gree ting, "great options fo r students." Emotion: Cheerful, welcoming Expr: Open posture, warm smile, gestured into shop Perception (Ja ck ← Sally) B: Confident seller; will re commend expensive items first. D: Eager to mak...
[79]

I nee d a laptop. What's unde r $4 00?

Jack: C1: "I nee d a laptop. What's unde r $4 00?"[E: 5 S :3 C:6 → 4.55] ▸ C2: "I nee d something for classes — writing, browsing, video calls. What's available?"[E: 8 S :8 C:9 → 8.25]✓ C3: "Hey, show me your che apest laptops."[E: 3 S :4 C:5 → 3.85] Perception (Sa lly ← Jack) Spee ch: "...What do you have in my... I mean, what's available?" Emotion: Slig...
[80]

I have a T hinkPad at $46 0. Basic but works

Sally C1: "I have a T hinkPad at $46 0. Basic but works."[E: 6 S :4 C:7 → 5.55] C2: "Everything starts at $60 0+. Let me know."[E: 4 S :7 C:5 → 5.30] ▸ C3: "This Dell — solid specs, 3-month warranty. $58 0."[E: 8 S :9 C:8 → 8.35]✓ Perception (Ja ck ← Sally) Spee ch: "This Dell — $58 0, fully checked, 3-month warranty." Emotion: Confident, professional Exp...

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2303.08774 (2023)

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Pith/arXiv arXiv 2023

[2] [2]

In: International conference on machine learning

Aher, G.V., Arriaga, R.I., Kalai, A.T.: Using large language models to simulate multiple humans and replicate human subject studies. In: International conference on machine learning. pp. 337–371. PMLR (2023)

2023

[3] [3]

Nature Human Behaviour1(4), 0064 (2017)

Baker, C.L., Jara-Ettinger, J., Saxe, R., Tenenbaum, J.B.: Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour1(4), 0064 (2017)

2017

[4] [4]

arXiv preprint arXiv:2311.15127 (2023)

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

Pith/arXiv arXiv 2023

[5] [5]

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer (2025),https://arxiv.org/abs/2412.00733

arXiv 2025

[6] [6]

Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire (2023),https: //arxiv.org/abs/2302.04166

Pith/arXiv arXiv 2023

[7] [7]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Guevarra, M., Bhattacharjee, I., Das, S., Wayllace, C., Epp, C.D., Taylor, M.E., Tay, A.: An llm-guided tutoring system for social skills training. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 29643–29645 (2025)

2025

[8] [8]

arXiv preprint arXiv:1704.07130 (2017)

He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collabora- tive dialogue agents with dynamic knowledge graph embeddings. arXiv preprint arXiv:1704.07130 (2017)

Pith/arXiv arXiv 2017

[9] [9]

He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. pp. 1766–1776 (01 2017).https://doi.org/10.18653/v1/P17-1162

work page doi:10.18653/v1/p17-1162 2017

[10] [10]

Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation (2022),https://arxiv.org/abs/2011. 06294

2022

[11] [11]

In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A

Jandaghi, P., Sheng, X., Bai, X., Pujara, J., Sidahmed, H.: Faithful persona-based conversational dataset generation with large language models. In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A. (eds.) Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4Conv...

2024

[12] [12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q., et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 193–203 (2025)

2025

[13] [13]

Advances in Neural Information Processing Systems36, 10622–10643 (2023)

Jiang, G., Xu, M., Zhu, S.C., Han, W., Zhang, C., Zhu, Y.: Evaluating and induc- ing personality in pre-trained language models. Advances in Neural Information Processing Systems36, 10622–10643 (2023)

2023

[14] [14]

Kihlstrom, J.F., Cantor, N.: Social intelligence. (2000)

2000

[15] [15]

arXiv preprint arXiv:2107.00956 (2021)

Kovač, G., Portelas, R., Hofmann, K., Oudeyer, P.Y.: Socialai: Benchmarking socio-cognitive abilities in deep reinforcement learning agents. arXiv preprint arXiv:2107.00956 (2021)

arXiv 2021

[16] [16]

Personality and Individual Differences102, 229–233 (2016)

Kwantes, P.J., Derbentseva, N., Lam, Q., Vartanian, O., Marmurek, H.H.: Assess- ing the big five personality traits with latent semantic analysis. Personality and Individual Differences102, 229–233 (2016)

2016

[17] [17]

Behavior research methods43(2), 548–567 (2011)

Lang, F.R., John, D., Lüdtke, O., Schupp, J., Wagner, G.G.: Short assessment of the big five: Robust across survey methods except telephone interviewing. Behavior research methods43(2), 548–567 (2011)

2011

[18] [18]

In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S

Lee, Y.J., Lim, C.G., Choi, Y., Lm, J.H., Choi, H.J.: PERSONACHATGEN: Gen- erating personalized dialogues using GPT-3. In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S. (eds.) Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge. pp. 29–48. Association for Computational Linguistic...

2022

[19] [19]

Advances in Neural Information Processing Systems36, 51991–52008 (2023)

Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems36, 51991–52008 (2023)

2023

[20] [20]

Lin, Y.T., Chen, Y.N.: Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. pp. 47–58 (01 2023). https://doi.org/10.18653/v1/2023.nlp4convai-1.5

work page doi:10.18653/v1/2023.nlp4convai-1.5 2023

[21] [21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lin, Y., Fung, H., Xu, J., Ren, Z., Lau, A.S., Yin, G., Li, X.: Mvportrait: Text- guided motion and emotion control for multi-view vivid portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26242–26252 (2025)

2025

[22] [22]

arXiv preprint arXiv:2412.19437 (2024)

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

Pith/arXiv arXiv 2024

[23] [23]

Liu, J., Wang, X., Fu, X., Chai, Y., Yu, C., Dai, J., Han, J.: Mfr-net: Multi- faceted responsive listening head generation via denoising diffusion model (2023), https://arxiv.org/abs/2308.16635

arXiv 2023

[24] [24]

In: Bouamor, H., Pino, J., Bali, K

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2511–2522. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2...

work page doi:10.18653/v1/2023.emnlp- 2023

[25] [25]

Advances in neural information processing systems30(2017)

Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi- agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems30(2017)

2017

[26] [26]

org/abs/2204.08451

Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion (2022),https://arxiv. org/abs/2204.08451

arXiv 2022

[27] [27]

Bokhovkin, S

Ng, E., Subramanian, S., Klein, D., Kanazawa, A., Darrell, T., Ginosar, S.: Can language models learn to listen? pp. 10049–10059 (10 2023).https://doi.org/ 10.1109/ICCV51070.2023.00925

work page doi:10.1109/iccv51070.2023.00925 2023

[28] [28]

https://doi.org/10.1037/xge0001277

Oey, L., Schachner, A., Vul, E.: Designing and detecting lies by reasoning about otheragents.JournalofExperimentalPsychology:General152,346–362(082022). https://doi.org/10.1037/xge0001277

work page doi:10.1037/xge0001277

[29] [29]

arXiv preprint arXiv:2411.13543 (2024)

Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pig- natelli, E., Kuciński, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)

arXiv 2024

[30] [30]

Generative agents: Interactive simulacra of human behavior,

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. UIST ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/ 3586183.3606763,https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023

[31] [31]

Premack, D., Woodruff, G.: Does the chimpanzee have a theory of mind? Behav- ioral and brain sciences1(4), 515–526 (1978)

1978

[32] [32]

arXiv preprint arXiv:2408.15787 (2024)

Qiu,H.,Lan,Z.:Interactiveagents:Simulatingcounselor-clientpsychologicalcoun- seling via role-playing llm-to-llm interactions. arXiv preprint arXiv:2408.15787 (2024)

Pith/arXiv arXiv 2024

[33] [33]

Hogrefe & Huber Publishers (2002)

de Raad, B.E., Perugini, M.E.: Big five assessment. Hogrefe & Huber Publishers (2002)

2002

[34] [34]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021

[35] [35]

Physics Letters A 338(3), 217–224 (May 2005), ISSN 0375-9601, https://doi.org/10.1016/j

Ryumina, E., Dresvyanskiy, D., Karpov, A.: In search of a robust facial ex- pressions recognition model: A large-scale visual cross-corpus study. Neurocom- puting514, 435–450 (2022).https://doi.org/https://doi.org/10.1016/j. neucom.2022.10.013,https://www.sciencedirect.com/science/article/pii/ S0925231222012656

work page doi:10.1016/j 2022

[36] [36]

In: Bouamor, H., Pino, J., Bali, K

Shao, Y., Li, L., Dai, J., Qiu, X.: Character-LLM: A trainable agent for role- playing. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 13153– 13187. Association for Computational Linguistics, Singapore (Dec 2023),https: //aclanthology.org/2023.emnlp-main.814/

2023

[37] [37]

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., Perez, E.: Towards understanding sycophancy in language models (2025),https: //arxiv.org/abs/2310.13548

Pith/arXiv arXiv 2025

[38] [38]

Shuster, K., Urbanek, J., Szlam, A., Weston, J.: Am i me or you? state-of-the-art dialogue models cannot maintain an identity (2021),https://arxiv.org/abs/ 2112.05843

arXiv 2021

[39] [39]

wadsworth Belmont, CA (2009) Resonant Minds 19

Sternberg, R.J., Sternberg, K., Mio, J.: Cognitive psychology. wadsworth Belmont, CA (2009) Resonant Minds 19

2009

[40] [40]

Nature human behaviour8(7), 1285–1295 (2024)

Strachan, J.W., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., Saxena, K., Rufo, A., Panzeri, S., Manzi, G., et al.: Testing theory of mind in large language models and humans. Nature human behaviour8(7), 1285–1295 (2024)

2024

[41] [41]

In: European Conference on Computer Vision

Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision. pp. 398–416. Springer (2025)

2025

[42] [42]

arXiv preprint arXiv:2504.18087 (2025)

Tan, W., Lin, C., Xu, C., Xu, F., Hu, X., Ji, X., Zhu, J., Wang, C., Fu, Y.: Dis- entangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation. arXiv preprint arXiv:2504.18087 (2025)

arXiv 2025

[43] [43]

arXiv preprint arXiv:2312.11805 (2023)

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

Pith/arXiv arXiv 2023

[44] [44]

Harvard university press (2009)

Tomasello, M.: The cultural origins of human cognition. Harvard university press (2009)

2009

[45] [45]

Belknap Press (2019)

Tomasello, M.: Becoming human: A theory of ontogeny. Belknap Press (2019)

2019

[46] [46]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, Z., Yu, Z., Zhou, Z., Zhou, J., Jin, X., Hong, F.T., Ji, X., Zhu, J., Cai, C., Tang, S., et al.: Hunyuanportrait: Implicit condition control for enhanced por- trait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15909–15919 (2025)

2025

[47] [47]

Yadav, N., Achananuparp, P., Jiang, J., Lim, E.P.: Dialtom: A theory of mind benchmark for forecasting state-driven dialogue trajectories (2026),https:// arxiv.org/abs/2604.20443

Pith/arXiv arXiv 2026

[48] [48]

arXiv preprint arXiv:2502.17810 (2025)

Yan, R., Li, X., Chen, W., Niu, Z., Yang, C., Ma, Z., Yu, K., Chen, X.: Uro- bench: A comprehensive benchmark for end-to-end spoken dialogue models. arXiv preprint arXiv:2502.17810 (2025)

arXiv 2025

[49] [49]

(eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personal- izing dialogue agents: I have a dog, do you have pets too? In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2204–2213. Asso- ciation for Computational Linguistics, Melbourne, ...

work page doi:10.18653/v1/p18-1205 2018

[50] [50]

Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation (2023),https://arxiv.org/abs/2211.12194

arXiv 2023

[51] [51]

Zhang, Z., Jin, C., Jia, M.Y., Zhang, S., Shu, T.: Autotom: Scaling model-based mental inference via automated agent modeling (2026),https://arxiv.org/abs/ 2502.15676

arXiv 2026

[52] [52]

Zhao, J., Yang, Q., Peng, Y., Bai, D., Yao, S., Sun, B., Chen, X., Fu, S., chen, W., Wei, X., Bo, L.: Humanomni: A large vision-speech language model for human- centric video understanding (2025),https://arxiv.org/abs/2501.15111

arXiv 2025

[53] [53]

In: European conference on computer vision

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Responsive listening head generation: a benchmark dataset and baseline. In: European conference on computer vision. pp. 124–142. Springer (2022)

2022

[54] [54]

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Vico-x: Multimodal conversation dataset.https://project.mhzhou.com/vico(2022), accessed: 2022- 09-30

2022

[55] [55]

Shangguan et al

Zhou, S., Zhou, Y., He, Y., Zhou, X., Wang, J., Deng, W., Shu, J.: Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech (2025),https://arxiv.org/abs/2506.21619 20 J. Shangguan et al

arXiv 2025

[56] [56]

In: Al- Onaizan, Y., Bansal, M., Chen, Y.N

Zhou, X., Su, Z., Eisape, T., Kim, H., Sap, M.: Is this the real life? is this just fantasy? the misleading success of simulating social interactions with LLMs. In: Al- Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 21692–21714. Association for Computational Linguistic...

work page doi:10.18653/v1/2024.emnlp- 2024

[57] [57]

ArXivabs/2310.11667(2023),https://api

Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., philippe Morency, L., Bisk, Y., Fried, D., Neubig, G., Sap, M.: Sotopia: Interactive evaluation for social intelligence in language agents. ArXivabs/2310.11667(2023),https://api. semanticscholar.org/CorpusID:264289186

arXiv 2023

[58] [58]

Showui: One vision-language- action model for GUI visual agent

Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven in- teractive head generation in dyadic conversations. pp. 10667–10677 (06 2025). https://doi.org/10.1109/CVPR52734.2025.00997 Resonant Minds 21 Resonant Minds: Closed-Loop Social A vatars with Theory of Mind Supplementary Material In this supplementary material, we present additi...

work page doi:10.1109/cvpr52734.2025.00997 2025

[59] [59]

Are you looking for something specific, or would you like me to show you some options?

Sally: Hi there! Welcome to the shop. Are you looking for something specific, or would you like me to show you some options?

[60] [60]

Mostly for writing papers, browsing, and some streaming

Jack: Hi! I'm looking for a reliable laptop for college. Mostly for writing papers, browsing, and some streaming. Do you have anything good for a student budget?

[61] [61]

I've fully serviced it myself

Sally: Absolutely! This Dell Inspiron has a fast processor, plenty of RAM, and excellent battery life. I've fully serviced it myself. I'm asking $675

[62] [62]

Is there any flexibility on the price?

Jack: I appreciate you showing me, but $675 is a bit higher than I was hoping. Is there any flexibility on the price?

[63] [63]

I could come down to $650 for you

Sally: I understand, college budgets can be tough. I could come down to $650 for you. That's a great deal considering the specs and warranty

[64] [64]

Would you be able to go any lower?

Jack: I appreciate that, but $650 is still a bit out of reach for me. Would you be able to go any lower?

[65] [65]

The best I can do on this Dell is $6 25

Sally: I understand where you're coming from. The best I can do on this Dell is $6 25. That's really competitive for what you're getting

[66] [66]

Do you have anything more budget-friendly?

Jack: I appreciate you working with me, but $625 is still a bit of a stretch. Do you have anything more budget-friendly?

[67] [67]

I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5

Sally: I understand. I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5. The ThinkPad is older but solid for college work

[68] [68]

The ThinkPad at $445 is closer to my range

Jack: I appreciate you showing me those. The ThinkPad at $445 is closer to my range. Would you be able to do any better on the price?

[69] [69]

I could do $430 for you on the ThinkPad

Sally: I understand. I could do $430 for you on the ThinkPad. I've recently replaced the battery, so it still holds a good charge

[70] [70]

Is there any chance you could go closer to $400? I'd take it today

Jack: I appreciate that, but $430 is still a bit of a stretch for me. Is there any chance you could go closer to $400? I'd take it today. Fig.D.1:Dialogue example inAgentmode (two independent LLM agents, no social reasoning). 34 J. Shangguan et al. Script Mode Two strangers meet at a second -hand electronics store. One is the store owner; the other is a c...

[71] [71]

Sally: Hi there, looking for anything in particular today?

[72] [72]

Do you have any good options?

Jack: Yeah, I need a laptop for college, but I'm on a bit of a budget. Do you have any good options?

[73] [73]

I'm asking $550 for it

Sally: Absolutely, I have this model right here — great condition, reliable battery. I'm asking $550 for it

[74] [74]

It looks good, but $550 is a bit high for me

Jack: I see. It looks good, but $550 is a bit high for me. My budget's really capped at $500

[75] [75]

How about $525? That's a great deal for what you're getting

Sally: I understand, but this laptop is one of our best. How about $525? That's a great deal for what you're getting

[76] [76]

If you can do $500, I'd be ready to buy it right now

Jack: I really can't go above $500. If you can do $500, I'd be ready to buy it right now

[77] [77]

I'll ring you up at the counter

Sally: Alright, $500 it is. I'll ring you up at the counter. Fig.D.2:Dialogue example inScriptmode (single omniscient LLM generates both sides). Resonant Minds 35 Our Method Two strangers meet at a second -hand electronics store. One is the store owner; the other is a college student looking to buy a laptop on a tight budget. Sally Chen She/her · 34 Secon...

[78] [78]

great options fo r students

Sally: Hey there! Looking for a laptop today? I've got some great options for students. Perception (Ja ck ← Sally) Spee ch: Friendly gree ting, "great options fo r students." Emotion: Cheerful, welcoming Expr: Open posture, warm smile, gestured into shop Perception (Ja ck ← Sally) B: Confident seller; will re commend expensive items first. D: Eager to mak...

[79] [79]

I nee d a laptop. What's unde r $4 00?

Jack: C1: "I nee d a laptop. What's unde r $4 00?"[E: 5 S :3 C:6 → 4.55] ▸ C2: "I nee d something for classes — writing, browsing, video calls. What's available?"[E: 8 S :8 C:9 → 8.25]✓ C3: "Hey, show me your che apest laptops."[E: 3 S :4 C:5 → 3.85] Perception (Sa lly ← Jack) Spee ch: "...What do you have in my... I mean, what's available?" Emotion: Slig...

[80] [80]

I have a T hinkPad at $46 0. Basic but works

Sally C1: "I have a T hinkPad at $46 0. Basic but works."[E: 6 S :4 C:7 → 5.55] C2: "Everything starts at $60 0+. Let me know."[E: 4 S :7 C:5 → 5.30] ▸ C3: "This Dell — solid specs, 3-month warranty. $58 0."[E: 8 S :9 C:8 → 8.35]✓ Perception (Ja ck ← Sally) Spee ch: "This Dell — $58 0, fully checked, 3-month warranty." Emotion: Confident, professional Exp...