Resonant Minds: Closed-Loop Social Avatars with Theory of Mind
Pith reviewed 2026-06-28 02:02 UTC · model grok-4.3
The pith
A closed-loop dual-agent system uses Theory of Mind to infer hidden mental states and generate superior dialogue and reactive video expressions even without full information access.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a closed-loop dual-agent architecture combining multimodal perception, Theory of Mind-based mental state inference with ensemble response selection, and emotion-controllable video generation with bidirectional speaker-listener dynamics yields competitive or superior performance on dialogue quality and video metrics, including surpassing full-information Script baselines on dialogue dimensions by modeling hidden mental states under information asymmetry.
What carries the argument
The closed-loop dual-agent framework that integrates a perception module analyzing video behaviors, a social reasoning module performing Theory of Mind inference and ensemble response selection, and an expression module synthesizing speech, expressions, and reactive listener behaviors.
If this is right
- Explicit mental state inference under uncertainty produces more thoughtful dialogue than providing all information at once.
- Bidirectional video generation captures listener reactions that prior one-way talking-head models omit.
- The ensemble mechanism in social reasoning improves response selection when partner goals remain hidden.
- The Persona-Scenario dataset enables repeatable testing of social intelligence under controlled information asymmetry.
Where Pith is reading between the lines
- Similar closed loops could be tested in live video calls or virtual meetings where participants have private objectives.
- The approach may extend to domains like negotiation training where inferring unstated intentions changes outcomes.
- If the ToM module generalizes beyond the dataset, it could support multi-turn interactions with changing private goals.
Load-bearing premise
The hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals accurately captures real-world information asymmetry and supports valid evaluation of Theory of Mind performance.
What would settle it
A side-by-side human evaluation on the Persona-Scenario dataset in which the closed-loop method does not receive higher ratings than the full-information Script mode on thoughtfulness or engagement metrics would falsify the central performance claim.
Figures
read the original abstract
Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable videos that jointly synthesize speaker speech and facial expressions with listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We further construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access. Project page: https://resonantminds.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a closed-loop dual-agent framework for social avatars that unifies multimodal perception of partner behaviors, Theory of Mind inference of hidden mental states, ensemble-based response selection, and bidirectional expression generation for emotion-controllable talking-head videos. It introduces a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to enable evaluation under information asymmetry, and reports competitive or superior performance on dialogue quality and video metrics, including the headline result that the method surpasses even the full-information Script baseline on key dialogue dimensions.
Significance. If the results hold after validation, the work could advance embodied social AI by demonstrating that explicit mental-state inference under uncertainty can produce more thoughtful dialogue than unrestricted access, with the closed-loop integration of perception-reasoning-expression as a potential template for future systems. The dataset, if shown to faithfully encode realistic asymmetry, would also be a reusable resource for ToM research.
major comments (2)
- [Dataset Construction] Dataset section: the central claim that explicit ToM inference elicits more thoughtful dialogue than full-information access rests on the hierarchical Persona-Scenario dataset faithfully encoding private goals and information asymmetry, yet no inter-annotator agreement, external validation of goal privacy, or ablation on goal realism is referenced; without these, the surpassing-Script result cannot be interpreted as evidence for the benefit of mental-state inference.
- [Experiments] Experiments section: the headline result (surpassing full-info Script mode on dialogue quality) is stated without visible metrics, error bars, statistical tests, or details on the ensemble response selection mechanism and how it interacts with ToM outputs, rendering it impossible to assess whether the comparison supports the claim or reflects dataset artifacts.
minor comments (1)
- [Abstract] Abstract: the description of the perception module analyzing 'multimodal behaviors from video' and the expression module generating 'listener reactive behaviors' would benefit from a brief forward reference to the specific architectures or loss terms used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the dataset construction and experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset section: the central claim that explicit ToM inference elicits more thoughtful dialogue than full-information access rests on the hierarchical Persona-Scenario dataset faithfully encoding private goals and information asymmetry, yet no inter-annotator agreement, external validation of goal privacy, or ablation on goal realism is referenced; without these, the surpassing-Script result cannot be interpreted as evidence for the benefit of mental-state inference.
Authors: We agree that explicit validation metrics would improve interpretability of the results. In the revision we will add inter-annotator agreement scores for the persona and scenario annotations, include a brief external validation study confirming goal privacy, and provide an ablation or extended discussion on goal realism grounded in the cited psychological literature. These changes will directly support the interpretation of the ToM versus full-information comparison. revision: yes
-
Referee: [Experiments] Experiments section: the headline result (surpassing full-info Script mode on dialogue quality) is stated without visible metrics, error bars, statistical tests, or details on the ensemble response selection mechanism and how it interacts with ToM outputs, rendering it impossible to assess whether the comparison supports the claim or reflects dataset artifacts.
Authors: The experiments section already reports quantitative dialogue metrics with error bars and statistical tests; we will revise the text to make these values and p-values more prominent and to expand the description of the ensemble response selection, explicitly showing how ToM-inferred mental states are combined with other signals to produce the final response distribution. This clarification will allow readers to evaluate the comparison directly. revision: yes
Circularity Check
No circularity: experimental comparisons with no derivations or self-referential reductions
full rationale
The paper describes a closed-loop framework and reports experimental results on a constructed hierarchical Persona-Scenario dataset. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim (surpassing full-info Script mode via explicit ToM) rests on direct empirical comparisons rather than any reduction to inputs by construction. Dataset construction is an evaluation choice, not a definitional loop. This matches the default expectation of no significant circularity for papers without mathematical chains.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Theory of Mind inference from multimodal video behaviors can be implemented to select superior responses under information asymmetry
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2303.08774 (2023)
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Pith/arXiv arXiv 2023
-
[2]
In: International conference on machine learning
Aher, G.V., Arriaga, R.I., Kalai, A.T.: Using large language models to simulate multiple humans and replicate human subject studies. In: International conference on machine learning. pp. 337–371. PMLR (2023)
2023
-
[3]
Nature Human Behaviour1(4), 0064 (2017)
Baker, C.L., Jara-Ettinger, J., Saxe, R., Tenenbaum, J.B.: Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour1(4), 0064 (2017)
2017
-
[4]
arXiv preprint arXiv:2311.15127 (2023)
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
Pith/arXiv arXiv 2023
-
[5]
Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer (2025),https://arxiv.org/abs/2412.00733
arXiv 2025
-
[6]
Fu, J., Ng, S.K., Jiang, Z., Liu, P.: Gptscore: Evaluate as you desire (2023),https: //arxiv.org/abs/2302.04166
Pith/arXiv arXiv 2023
-
[7]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Guevarra, M., Bhattacharjee, I., Das, S., Wayllace, C., Epp, C.D., Taylor, M.E., Tay, A.: An llm-guided tutoring system for social skills training. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 29643–29645 (2025)
2025
-
[8]
arXiv preprint arXiv:1704.07130 (2017)
He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collabora- tive dialogue agents with dynamic knowledge graph embeddings. arXiv preprint arXiv:1704.07130 (2017)
Pith/arXiv arXiv 2017
-
[9]
He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. pp. 1766–1776 (01 2017).https://doi.org/10.18653/v1/P17-1162
-
[10]
Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation (2022),https://arxiv.org/abs/2011. 06294
2022
-
[11]
In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A
Jandaghi, P., Sheng, X., Bai, X., Pujara, J., Sidahmed, H.: Faithful persona-based conversational dataset generation with large language models. In: Nouri, E., Ras- togi, A., Spithourakis, G., Liu, B., Chen, Y.N., Li, Y., Albalak, A., Wakaki, H., Resonant Minds 17 Papangelis, A. (eds.) Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4Conv...
2024
-
[12]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Ji, X., Hu, X., Xu, Z., Zhu, J., Lin, C., He, Q., Zhang, J., Luo, D., Chen, Y., Lin, Q., et al.: Sonic: Shifting focus to global audio perception in portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 193–203 (2025)
2025
-
[13]
Advances in Neural Information Processing Systems36, 10622–10643 (2023)
Jiang, G., Xu, M., Zhu, S.C., Han, W., Zhang, C., Zhu, Y.: Evaluating and induc- ing personality in pre-trained language models. Advances in Neural Information Processing Systems36, 10622–10643 (2023)
2023
-
[14]
Kihlstrom, J.F., Cantor, N.: Social intelligence. (2000)
2000
-
[15]
arXiv preprint arXiv:2107.00956 (2021)
Kovač, G., Portelas, R., Hofmann, K., Oudeyer, P.Y.: Socialai: Benchmarking socio-cognitive abilities in deep reinforcement learning agents. arXiv preprint arXiv:2107.00956 (2021)
arXiv 2021
-
[16]
Personality and Individual Differences102, 229–233 (2016)
Kwantes, P.J., Derbentseva, N., Lam, Q., Vartanian, O., Marmurek, H.H.: Assess- ing the big five personality traits with latent semantic analysis. Personality and Individual Differences102, 229–233 (2016)
2016
-
[17]
Behavior research methods43(2), 548–567 (2011)
Lang, F.R., John, D., Lüdtke, O., Schupp, J., Wagner, G.G.: Short assessment of the big five: Robust across survey methods except telephone interviewing. Behavior research methods43(2), 548–567 (2011)
2011
-
[18]
In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S
Lee, Y.J., Lim, C.G., Choi, Y., Lm, J.H., Choi, H.J.: PERSONACHATGEN: Gen- erating personalized dialogues using GPT-3. In: Lim, H., Kim, S., Lee, Y., Lin, S., Seo, P.H., Suh, Y., Jang, Y., Lim, J., Hur, Y., Son, S. (eds.) Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge. pp. 29–48. Association for Computational Linguistic...
2022
-
[19]
Advances in Neural Information Processing Systems36, 51991–52008 (2023)
Li, G., Hammoud, H., Itani, H., Khizbullin, D., Ghanem, B.: Camel: Communica- tive agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems36, 51991–52008 (2023)
2023
-
[20]
Lin, Y.T., Chen, Y.N.: Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. pp. 47–58 (01 2023). https://doi.org/10.18653/v1/2023.nlp4convai-1.5
-
[21]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Lin, Y., Fung, H., Xu, J., Ren, Z., Lau, A.S., Yin, G., Li, X.: Mvportrait: Text- guided motion and emotion control for multi-view vivid portrait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26242–26252 (2025)
2025
-
[22]
arXiv preprint arXiv:2412.19437 (2024)
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)
Pith/arXiv arXiv 2024
-
[23]
Liu, J., Wang, X., Fu, X., Chai, Y., Yu, C., Dai, J., Han, J.: Mfr-net: Multi- faceted responsive listening head generation via denoising diffusion model (2023), https://arxiv.org/abs/2308.16635
arXiv 2023
-
[24]
In: Bouamor, H., Pino, J., Bali, K
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG evaluation using gpt-4 with better human alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2511–2522. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2...
-
[25]
Advances in neural information processing systems30(2017)
Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi- agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems30(2017)
2017
-
[26]
Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion (2022),https://arxiv. org/abs/2204.08451
arXiv 2022
-
[27]
Ng, E., Subramanian, S., Klein, D., Kanazawa, A., Darrell, T., Ginosar, S.: Can language models learn to listen? pp. 10049–10059 (10 2023).https://doi.org/ 10.1109/ICCV51070.2023.00925
-
[28]
https://doi.org/10.1037/xge0001277
Oey, L., Schachner, A., Vul, E.: Designing and detecting lies by reasoning about otheragents.JournalofExperimentalPsychology:General152,346–362(082022). https://doi.org/10.1037/xge0001277
-
[29]
arXiv preprint arXiv:2411.13543 (2024)
Paglieri, D., Cupiał, B., Coward, S., Piterbarg, U., Wolczyk, M., Khan, A., Pig- natelli, E., Kuciński, Ł., Pinto, L., Fergus, R., et al.: Balrog: Benchmarking agentic llm and vlm reasoning on games. arXiv preprint arXiv:2411.13543 (2024)
arXiv 2024
-
[30]
Generative agents: Interactive simulacra of human behavior,
Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. UIST ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/ 3586183.3606763,https://doi.org/10.1145/3586183.3606763
-
[31]
Premack, D., Woodruff, G.: Does the chimpanzee have a theory of mind? Behav- ioral and brain sciences1(4), 515–526 (1978)
1978
-
[32]
arXiv preprint arXiv:2408.15787 (2024)
Qiu,H.,Lan,Z.:Interactiveagents:Simulatingcounselor-clientpsychologicalcoun- seling via role-playing llm-to-llm interactions. arXiv preprint arXiv:2408.15787 (2024)
Pith/arXiv arXiv 2024
-
[33]
Hogrefe & Huber Publishers (2002)
de Raad, B.E., Perugini, M.E.: Big five assessment. Hogrefe & Huber Publishers (2002)
2002
-
[34]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[35]
Physics Letters A 338(3), 217–224 (May 2005), ISSN 0375-9601, https://doi.org/10.1016/j
Ryumina, E., Dresvyanskiy, D., Karpov, A.: In search of a robust facial ex- pressions recognition model: A large-scale visual cross-corpus study. Neurocom- puting514, 435–450 (2022).https://doi.org/https://doi.org/10.1016/j. neucom.2022.10.013,https://www.sciencedirect.com/science/article/pii/ S0925231222012656
work page doi:10.1016/j 2022
-
[36]
In: Bouamor, H., Pino, J., Bali, K
Shao, Y., Li, L., Dai, J., Qiu, X.: Character-LLM: A trainable agent for role- playing. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 13153– 13187. Association for Computational Linguistics, Singapore (Dec 2023),https: //aclanthology.org/2023.emnlp-main.814/
2023
-
[37]
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., Perez, E.: Towards understanding sycophancy in language models (2025),https: //arxiv.org/abs/2310.13548
Pith/arXiv arXiv 2025
-
[38]
Shuster, K., Urbanek, J., Szlam, A., Weston, J.: Am i me or you? state-of-the-art dialogue models cannot maintain an identity (2021),https://arxiv.org/abs/ 2112.05843
arXiv 2021
-
[39]
wadsworth Belmont, CA (2009) Resonant Minds 19
Sternberg, R.J., Sternberg, K., Mio, J.: Cognitive psychology. wadsworth Belmont, CA (2009) Resonant Minds 19
2009
-
[40]
Nature human behaviour8(7), 1285–1295 (2024)
Strachan, J.W., Albergo, D., Borghini, G., Pansardi, O., Scaliti, E., Gupta, S., Saxena, K., Rufo, A., Panzeri, S., Manzi, G., et al.: Testing theory of mind in large language models and humans. Nature human behaviour8(7), 1285–1295 (2024)
2024
-
[41]
In: European Conference on Computer Vision
Tan, S., Ji, B., Bi, M., Pan, Y.: Edtalk: Efficient disentanglement for emotional talking head synthesis. In: European Conference on Computer Vision. pp. 398–416. Springer (2025)
2025
-
[42]
arXiv preprint arXiv:2504.18087 (2025)
Tan, W., Lin, C., Xu, C., Xu, F., Hu, X., Ji, X., Zhu, J., Wang, C., Fu, Y.: Dis- entangle identity, cooperate emotion: Correlation-aware emotional talking portrait generation. arXiv preprint arXiv:2504.18087 (2025)
arXiv 2025
-
[43]
arXiv preprint arXiv:2312.11805 (2023)
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Pith/arXiv arXiv 2023
-
[44]
Harvard university press (2009)
Tomasello, M.: The cultural origins of human cognition. Harvard university press (2009)
2009
-
[45]
Belknap Press (2019)
Tomasello, M.: Becoming human: A theory of ontogeny. Belknap Press (2019)
2019
-
[46]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Xu, Z., Yu, Z., Zhou, Z., Zhou, J., Jin, X., Hong, F.T., Ji, X., Zhu, J., Cai, C., Tang, S., et al.: Hunyuanportrait: Implicit condition control for enhanced por- trait animation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15909–15919 (2025)
2025
-
[47]
Yadav, N., Achananuparp, P., Jiang, J., Lim, E.P.: Dialtom: A theory of mind benchmark for forecasting state-driven dialogue trajectories (2026),https:// arxiv.org/abs/2604.20443
Pith/arXiv arXiv 2026
-
[48]
arXiv preprint arXiv:2502.17810 (2025)
Yan, R., Li, X., Chen, W., Niu, Z., Yang, C., Ma, Z., Yu, K., Chen, X.: Uro- bench: A comprehensive benchmark for end-to-end spoken dialogue models. arXiv preprint arXiv:2502.17810 (2025)
arXiv 2025
-
[49]
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personal- izing dialogue agents: I have a dog, do you have pets too? In: Gurevych, I., Miyao, Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2204–2213. Asso- ciation for Computational Linguistics, Melbourne, ...
-
[50]
Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation (2023),https://arxiv.org/abs/2211.12194
arXiv 2023
-
[51]
Zhang, Z., Jin, C., Jia, M.Y., Zhang, S., Shu, T.: Autotom: Scaling model-based mental inference via automated agent modeling (2026),https://arxiv.org/abs/ 2502.15676
arXiv 2026
-
[52]
Zhao, J., Yang, Q., Peng, Y., Bai, D., Yao, S., Sun, B., Chen, X., Fu, S., chen, W., Wei, X., Bo, L.: Humanomni: A large vision-speech language model for human- centric video understanding (2025),https://arxiv.org/abs/2501.15111
arXiv 2025
-
[53]
In: European conference on computer vision
Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Responsive listening head generation: a benchmark dataset and baseline. In: European conference on computer vision. pp. 124–142. Springer (2022)
2022
-
[54]
Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T.: Vico-x: Multimodal conversation dataset.https://project.mhzhou.com/vico(2022), accessed: 2022- 09-30
2022
-
[55]
Zhou, S., Zhou, Y., He, Y., Zhou, X., Wang, J., Deng, W., Shu, J.: Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech (2025),https://arxiv.org/abs/2506.21619 20 J. Shangguan et al
arXiv 2025
-
[56]
In: Al- Onaizan, Y., Bansal, M., Chen, Y.N
Zhou, X., Su, Z., Eisape, T., Kim, H., Sap, M.: Is this the real life? is this just fantasy? the misleading success of simulating social interactions with LLMs. In: Al- Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 21692–21714. Association for Computational Linguistic...
-
[57]
ArXivabs/2310.11667(2023),https://api
Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., philippe Morency, L., Bisk, Y., Fried, D., Neubig, G., Sap, M.: Sotopia: Interactive evaluation for social intelligence in language agents. ArXivabs/2310.11667(2023),https://api. semanticscholar.org/CorpusID:264289186
arXiv 2023
-
[58]
Showui: One vision-language- action model for GUI visual agent
Zhu, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driven in- teractive head generation in dyadic conversations. pp. 10667–10677 (06 2025). https://doi.org/10.1109/CVPR52734.2025.00997 Resonant Minds 21 Resonant Minds: Closed-Loop Social A vatars with Theory of Mind Supplementary Material In this supplementary material, we present additi...
-
[59]
Are you looking for something specific, or would you like me to show you some options?
Sally: Hi there! Welcome to the shop. Are you looking for something specific, or would you like me to show you some options?
-
[60]
Mostly for writing papers, browsing, and some streaming
Jack: Hi! I'm looking for a reliable laptop for college. Mostly for writing papers, browsing, and some streaming. Do you have anything good for a student budget?
-
[61]
I've fully serviced it myself
Sally: Absolutely! This Dell Inspiron has a fast processor, plenty of RAM, and excellent battery life. I've fully serviced it myself. I'm asking $675
-
[62]
Is there any flexibility on the price?
Jack: I appreciate you showing me, but $675 is a bit higher than I was hoping. Is there any flexibility on the price?
-
[63]
I could come down to $650 for you
Sally: I understand, college budgets can be tough. I could come down to $650 for you. That's a great deal considering the specs and warranty
-
[64]
Would you be able to go any lower?
Jack: I appreciate that, but $650 is still a bit out of reach for me. Would you be able to go any lower?
-
[65]
The best I can do on this Dell is $6 25
Sally: I understand where you're coming from. The best I can do on this Dell is $6 25. That's really competitive for what you're getting
-
[66]
Do you have anything more budget-friendly?
Jack: I appreciate you working with me, but $625 is still a bit of a stretch. Do you have anything more budget-friendly?
-
[67]
I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5
Sally: I understand. I have an HP Pavilion at $525 and a Lenovo ThinkPad at $44 5. The ThinkPad is older but solid for college work
-
[68]
The ThinkPad at $445 is closer to my range
Jack: I appreciate you showing me those. The ThinkPad at $445 is closer to my range. Would you be able to do any better on the price?
-
[69]
I could do $430 for you on the ThinkPad
Sally: I understand. I could do $430 for you on the ThinkPad. I've recently replaced the battery, so it still holds a good charge
-
[70]
Is there any chance you could go closer to $400? I'd take it today
Jack: I appreciate that, but $430 is still a bit of a stretch for me. Is there any chance you could go closer to $400? I'd take it today. Fig.D.1:Dialogue example inAgentmode (two independent LLM agents, no social reasoning). 34 J. Shangguan et al. Script Mode Two strangers meet at a second -hand electronics store. One is the store owner; the other is a c...
-
[71]
Sally: Hi there, looking for anything in particular today?
-
[72]
Do you have any good options?
Jack: Yeah, I need a laptop for college, but I'm on a bit of a budget. Do you have any good options?
-
[73]
I'm asking $550 for it
Sally: Absolutely, I have this model right here — great condition, reliable battery. I'm asking $550 for it
-
[74]
It looks good, but $550 is a bit high for me
Jack: I see. It looks good, but $550 is a bit high for me. My budget's really capped at $500
-
[75]
How about $525? That's a great deal for what you're getting
Sally: I understand, but this laptop is one of our best. How about $525? That's a great deal for what you're getting
-
[76]
If you can do $500, I'd be ready to buy it right now
Jack: I really can't go above $500. If you can do $500, I'd be ready to buy it right now
-
[77]
I'll ring you up at the counter
Sally: Alright, $500 it is. I'll ring you up at the counter. Fig.D.2:Dialogue example inScriptmode (single omniscient LLM generates both sides). Resonant Minds 35 Our Method Two strangers meet at a second -hand electronics store. One is the store owner; the other is a college student looking to buy a laptop on a tight budget. Sally Chen She/her · 34 Secon...
-
[78]
great options fo r students
Sally: Hey there! Looking for a laptop today? I've got some great options for students. Perception (Ja ck ← Sally) Spee ch: Friendly gree ting, "great options fo r students." Emotion: Cheerful, welcoming Expr: Open posture, warm smile, gestured into shop Perception (Ja ck ← Sally) B: Confident seller; will re commend expensive items first. D: Eager to mak...
-
[79]
I nee d a laptop. What's unde r $4 00?
Jack: C1: "I nee d a laptop. What's unde r $4 00?"[E: 5 S :3 C:6 → 4.55] ▸ C2: "I nee d something for classes — writing, browsing, video calls. What's available?"[E: 8 S :8 C:9 → 8.25]✓ C3: "Hey, show me your che apest laptops."[E: 3 S :4 C:5 → 3.85] Perception (Sa lly ← Jack) Spee ch: "...What do you have in my... I mean, what's available?" Emotion: Slig...
-
[80]
I have a T hinkPad at $46 0. Basic but works
Sally C1: "I have a T hinkPad at $46 0. Basic but works."[E: 6 S :4 C:7 → 5.55] C2: "Everything starts at $60 0+. Let me know."[E: 4 S :7 C:5 → 5.30] ▸ C3: "This Dell — solid specs, 3-month warranty. $58 0."[E: 8 S :9 C:8 → 8.35]✓ Perception (Ja ck ← Sally) Spee ch: "This Dell — $58 0, fully checked, 3-month warranty." Emotion: Confident, professional Exp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.