pith. sign in

arxiv: 2504.13898 · v3 · submitted 2025-04-07 · 💻 cs.HC · cs.AI

Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning

Pith reviewed 2026-05-22 21:11 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords social reasoninghuman-robot interactionfoundation modelsbenchmark datasetsocial errorsembodied conversationSHRECconversational mechanics
0
0 comments X

The pith

Foundation models exhibit substantial performance gaps in recognizing social deficits during human-robot interactions on the SHREC benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SHREC Dataset to evaluate foundation models' social reasoning in real human-robot conversations rather than human-human ones. It assembles roughly 400 videos with more than 10,000 annotations that mark social errors, competencies, rationales, and corrections. Eight tasks are defined to test detection of errors, identification of social attributes, understanding of interaction flow, and generation of corrective actions. Experiments with current models produce clear shortfalls relative to human performance, indicating that social reasoning remains difficult for embodied AI. The dataset is positioned as a resource to direct future improvements in socially capable robots.

Core claim

The SHREC Dataset is a benchmark of approximately 400 real-world human-robot interaction videos and over 10K annotations that capture robot social errors, competencies, underlying rationales, and corrections. It defines eight benchmark tasks targeting detection of social errors and competencies, identification of underlying social attributes, comprehension of interaction flow, and provision of rationale and alternative correct actions. Experiments with state-of-the-art foundation models reveal substantial performance gaps relative to human evaluators.

What carries the argument

The SHREC Dataset together with its eight benchmark tasks that measure social reasoning capabilities in human-robot interactions.

If this is right

  • Foundation models require improvements in emotion understanding, intention tracking, and conversational mechanics for robot contexts.
  • The dataset highlights social challenges unique to human-robot interactions that prior human-human datasets do not address.
  • Directions emerge for developing socially intelligent AI by targeting the identified failure modes.
  • The eight tasks provide concrete evaluation criteria for tracking progress in social reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots built with models trained to close these gaps could produce more natural daily interactions.
  • Similar video-based benchmarks could be applied to other embodied settings such as assistive devices or autonomous systems.
  • The performance gaps suggest that scaling alone may not suffice without explicit social-situation training data.

Load-bearing premise

The annotations and task definitions in the SHREC dataset accurately capture the social attributes, rationales, and corrections needed for real-world human-robot social reasoning.

What would settle it

A state-of-the-art foundation model achieving performance levels comparable to human evaluators across all eight tasks on the SHREC videos would falsify the reported substantial performance gaps.

Figures

Figures reproduced from arXiv: 2504.13898 by Cynthia Breazeal, Denison Guvenoz, Dong Won Lee, Hae Won Park, Louis-Philippe Morency, Parker Malachowsky, Sooyeon Jeong, Yubin Kim.

Figure 1
Figure 1. Figure 1: SHREC Dataset dataset offers real-world Social Human Robot Embodied Conversation videos and annotations of errors and competencies, the channel and type of social attribute, along with rationale and possible corrective actions. (Top) Error sourced from verbal (audio) channel, (Bottom) Error sourced from non-verbal (visual) channel. to the best of our knowledge, one of the largest real-world human–social ro… view at source ↗
Figure 2
Figure 2. Figure 2: SHREC Dataset contains high overlapping annotations with a high level of agreement. The dataset includes error and competency labels, and annotations for the source of evidence either from nonverbal cues, verbal cues, and explanatory factors in the form of seven key social attributes. participation and involvement in social interactions, including cues that indicate interest or disinterest, e.g., continuin… view at source ↗
Figure 3
Figure 3. Figure 3: Our benchmark offers eight tasks dedicated to probing four core facets of AI model’s social reasoning: (1) detecting social errors and competencies, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results per model across all 8 tasks. Human performance is marked in dashed lines. (L): language-only inputs, (L+V): language and visual inputs. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our dataset offers annotations identifying errors, competencies with [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Error Per Attribute [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attribute Identification F1 Per Attribute [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Annotation Procedure: Annotators watch the video, select moment of [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Participant Consented to share images for publications. A screenshot of our annotation tool. Our tool enables the viewing of the video interaction, [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Participant Consented to share images for publications. A screenshot of our internal annotation tool in the edit phase. Our tool flexibly allows the [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Original face (left) transformed into a fully synthetic version (right), preserving key social while ensuring privacy for responsible large-scale data [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Wellness [ [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Empathic [48] Dataset Statistics: We find that 73.1% of the dataset consists of overlapping annotations, where two annotators marked the sample . We refer the reader to Appendix REFER for the algorithm used to calculate overlaps. Amongst the overlapping samples as shown in Figure B, we find an 92.5% overall agreement, where annotators agree on the error/competency and social/competency labels. A random ag… view at source ↗
read the original abstract

Our work focuses on the social reasoning capabilities of foundation models for real-world human-robot interactions. We introduce the Social Human Robot Embodied Conversation (SHREC) Dataset, a benchmark of $\sim$400 real-world human-robot interaction videos and over 10K annotations, capturing robot social errors, competencies, underlying rationales, and corrections. Unlike prior datasets focused on human-human interactions, the SHREC Dataset uniquely highlights the social challenges faced by real-world social robots such as emotion understanding, intention tracking, and conversational mechanics. Moreover, current foundation models struggle to recognize these deficits, which manifest as subtle, socially situated failures. To evaluate AI models' capacity for social reasoning, we define eight benchmark tasks targeting critical areas such as (1) detection of social errors and competencies, (2) identification of underlying social attributes, (3) comprehension of interaction flow, and (4) providing rationale and alternative correct actions. Experiments with state-of-the-art foundation models, alongside human evaluations, reveal substantial performance gaps -- underscoring the difficulty and providing directions in developing socially intelligent AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the SHREC Dataset of ~400 real-world human-robot interaction videos and over 10K annotations capturing robot social errors, competencies, rationales, and corrections. It defines eight benchmark tasks targeting detection of social errors/competencies, identification of social attributes, comprehension of interaction flow, and provision of rationales/corrections. Experiments with state-of-the-art foundation models and human evaluations report substantial performance gaps, underscoring difficulties in social reasoning for embodied HRI.

Significance. If the annotations are shown to be reliable and the tasks validly isolate real social attributes in HRI, the work would supply a needed benchmark distinct from human-human datasets, offering concrete directions for improving foundation models on subtle, situated social failures such as emotion understanding and intention tracking.

major comments (1)
  1. §3 (Dataset Construction) and §4 (Benchmark Tasks): The manuscript supplies no quantitative details on the annotation protocol (number of annotators per item, training, adjudication) or inter-rater reliability metrics (Cohen/Fleiss kappa or equivalent) for the ~10K annotations. This directly undermines the central claim of model limitations, because observed gaps on the eight tasks could arise from noisy or subjective labels rather than genuine shortfalls in social reasoning.
minor comments (2)
  1. Abstract: The phrase 'substantial performance gaps' is stated without accompanying numerical results (e.g., accuracy or F1 differences between models and humans); adding these would strengthen the summary.
  2. §5 (Experiments): Clarify the exact prompting format and output parsing procedure used for each of the eight tasks to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for noting the potential significance of the SHREC dataset. We address the major comment on annotation protocol and reliability below.

read point-by-point responses
  1. Referee: §3 (Dataset Construction) and §4 (Benchmark Tasks): The manuscript supplies no quantitative details on the annotation protocol (number of annotators per item, training, adjudication) or inter-rater reliability metrics (Cohen/Fleiss kappa or equivalent) for the ~10K annotations. This directly undermines the central claim of model limitations, because observed gaps on the eight tasks could arise from noisy or subjective labels rather than genuine shortfalls in social reasoning.

    Authors: We agree that quantitative details on the annotation protocol and inter-rater reliability metrics are essential to establish label quality and support the validity of the benchmark tasks. The initial manuscript omitted these specifics. In the revised version, we will expand §3 to report the number of annotators per item, annotator training and adjudication procedures, and inter-rater reliability metrics (e.g., Fleiss' kappa) for the annotations. These additions will allow assessment of whether the observed model performance gaps reflect genuine social reasoning challenges. revision: yes

Circularity Check

0 steps flagged

Empirical dataset and benchmark paper with no derivations or self-referential predictions

full rationale

The paper introduces the SHREC dataset of ~400 HRI videos and >10K annotations, defines eight benchmark tasks, and reports model performance gaps. No equations, fitted parameters, or derivation chains appear in the provided text. The central claims rest on empirical annotation and evaluation rather than any self-definition, fitted-input-as-prediction, or self-citation load-bearing step. External model evaluations and human comparisons serve as independent benchmarks. This is the normal case of a self-contained empirical contribution; no circularity is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the new annotations and task definitions constitute a valid measure of social reasoning; no free parameters or invented entities are described.

axioms (1)
  • domain assumption The eight benchmark tasks accurately target and measure critical areas of social reasoning in human-robot interactions.
    Abstract defines the tasks but provides no justification or validation for why they capture the intended constructs.

pith-pipeline@v0.9.0 · 5751 in / 1074 out tokens · 60629 ms · 2026-05-22T21:11:32.893957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 11 internal anchors

  1. [1]

    Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174, 2025

    Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174, 2025

  2. [2]

    Inter-coder agreement for computational linguistics.Computational linguistics, 34(4):555–596, 2008

    Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics.Computational linguistics, 34(4):555–596, 2008

  3. [3]

    A new test of social sensitivity: Detection of faux pas in normal children and children with asperger syndrome.Journal of Autism and Developmental Disorders, 29(5):407–418, 1999

    Simon Baron-Cohen, Michelle O’Riordan, Rosie Jones, Valerie Stone, and Kate Plaisted. A new test of social sensitivity: Detection of faux pas in normal children and children with asperger syndrome.Journal of Autism and Developmental Disorders, 29(5):407–418, 1999

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  5. [5]

    Sree Bhattacharyya and James Z. Wang. Evaluating vision-language models for emotion recognition. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 1798–1820, Albuquerque, New Mexico, April

  6. [6]

    ISBN 979-8-89176-195-7

    Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025. findings-naacl.97/

  7. [7]

    Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185– 24198, 2024

  8. [8]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  9. [9]

    Pydantic, 1

    Samuel Colvin, Eric Jolibois, Hasan Ramezani, Adrian Garcia Badaracco, Terrence Dorsey, David Montague, Serge Matveenko, Marcelo Trylesinski, Sydney Runkle, David Hewitt, Alex Hall, and Victorien Plot. Pydantic, 1

  10. [10]

    URL https://docs.pydantic.dev/latest/

  11. [11]

    Aya vision: Advancing the frontier of multilingual multimodality.arXiv preprint arXiv:2505.08751, 2025

    Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, et al. Aya vision: Advancing the frontier of multilingual multimodality.arXiv preprint arXiv:2505.08751, 2025

  12. [12]

    Commonsense reasoning and commonsense knowledge in artificial intelligence

    Ernest Davis and Gary Marcus. Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9):92–103, 2015

  13. [13]

    Interpersonal reactivity index.Journal of Personality and Social Psychology, 1980

    Mark H Davis. Interpersonal reactivity index.Journal of Personality and Social Psychology, 1980

  14. [14]

    Socratis: Are large multimodal models emotionally aware?arXiv preprint arXiv:2308.16741, 2023

    Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A Plummer, and Kate Saenko. Socratis: Are large multimodal models emotionally aware?arXiv preprint arXiv:2308.16741, 2023

  15. [15]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  16. [16]

    Introducing masc: a movie for the assessment of social cognition.Journal of autism and developmental disorders, 36:623–636, 2006

    Isabel Dziobek, Stefan Fleck, Elke Kalbe, Kimberley Rogers, Jason Hassenstab, Matthias Brand, Josef Kessler, Jan K Woike, Oliver T Wolf, and Antonio Convit. Introducing masc: a movie for the assessment of social cognition.Journal of autism and developmental disorders, 36:623–636, 2006

  17. [17]

    Repairing trust in robots?: A meta-analysis of hri trust repair studies with a no-repair condition

    Connor Esterwood and Lionel P Robert. Repairing trust in robots?: A meta-analysis of hri trust repair studies with a no-repair condition. In2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 410–419. IEEE, 2025

  18. [18]

    The artificial- social-agent questionnaire: establishing the long and short questionnaire versions

    Siska Fitrianie, Merijn Bruijnes, Fengxiang Li, Amal Abdulrahman, and Willem-Paul Brinkman. The artificial- social-agent questionnaire: establishing the long and short questionnaire versions. InProceedings of the 22nd ACM International Conference on Intelligent Virtual Agents, pages 1–8, 2022

  19. [19]

    Investigating con- versational dynamics: Interactive alignment, interpersonal synergy, and collective task performance.Cognitive science, 40(1):145–171, 2016

    Riccardo Fusaroli and Kristian Tylén. Investigating con- versational dynamics: Interactive alignment, interpersonal synergy, and collective task performance.Cognitive science, 40(1):145–171, 2016

  20. [20]

    Systematic analysis of video data from different human– robot interaction studies: a categorization of social signals during error situations.Frontiers in psychology, 6:931, 2015

    Manuel Giuliani, Nicole Mirnig, Gerald Stollnberger, Susanne Stadler, Roland Buchner, and Manfred Tscheligi. Systematic analysis of video data from different human– robot interaction studies: a categorization of social signals during error situations.Frontiers in psychology, 6:931, 2015

  21. [21]

    reading the mind in films

    Ofer Golan, Simon Baron-Cohen, Jacqueline J Hill, and Yael Golan. The “reading the mind in films” task: complex emotion recognition in adults with and without autism spectrum conditions.Social Neuroscience,, 1(2):111–123, 2006

  22. [22]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  24. [24]

    Affective social competence.Social development, 10(1):79–119, 2001

    Amy G Halberstadt, Susanne A Denham, and Julie C Dun- smore. Affective social competence.Social development, 10(1):79–119, 2001

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  26. [26]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  27. [27]

    Explanations from a robotic partner build trust on the robot’s decisions for collaborative human-humanoid interaction.Robotics, 10(1):51, 2021

    Misbah Javaid and Vladimir Estivill-Castro. Explanations from a robotic partner build trust on the robot’s decisions for collaborative human-humanoid interaction.Robotics, 10(1):51, 2021

  28. [28]

    A robotic positive psychology coach to improve college students’ wellbeing

    Sooyeon Jeong, Sharifa Alghowinem, Laura Aymerich- Franch, Kika Arias, Agata Lapedriza, Rosalind Picard, Hae Won Park, and Cynthia Breazeal. A robotic positive psychology coach to improve college students’ wellbeing. In2020 29th IEEE international conference on robot and human interactive communication (RO-MAN), pages 187–194. IEEE, 2020

  29. [29]

    A robotic companion for psychological well-being: A long-term investigation of companionship and therapeutic alliance

    Sooyeon Jeong, Laura Aymerich-Franch, Sharifa Al- ghowinem, Rosalind W Picard, Cynthia L Breazeal, and Hae Won Park. A robotic companion for psychological well-being: A long-term investigation of companionship and therapeutic alliance. InProceedings of the 2023 ACM/IEEE international conference on human-robot interaction, pages 485–494, 2023

  30. [30]

    Deploying a robotic positive psychology coach to improve college students’ psychological well-being.User Modeling and User-Adapted Interaction, 33(2):571–615, 2023

    Sooyeon Jeong, Laura Aymerich-Franch, Kika Arias, Sharifa Alghowinem, Agata Lapedriza, Rosalind Picard, Hae Won Park, and Cynthia Breazeal. Deploying a robotic positive psychology coach to improve college students’ psychological well-being.User Modeling and User-Adapted Interaction, 33(2):571–615, 2023

  31. [31]

    Trust repair in human-agent teams: the effectiveness of explanations and expressing regret

    Esther S Kox, José H Kerstholt, Tom F Hueting, and Peter W de Vries. Trust repair in human-agent teams: the effectiveness of explanations and expressing regret. Autonomous agents and multi-agent systems, 35(2):30, 2021

  32. [32]

    Building machines that learn and think like people.Behavioral and brain sciences, 40: e253, 2017

    Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40: e253, 2017

  33. [33]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

  34. [34]

    RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

  35. [35]

    Explanation-based finetuning makes models more robust to spurious cues.arXiv preprint arXiv:2305.04990, 2023

    Josh Magnus Ludan, Yixuan Meng, Tai Nguyen, Saurabh Shah, Qing Lyu, Marianna Apidianaki, and Chris Callison- Burch. Explanation-based finetuning makes models more robust to spurious cues.arXiv preprint arXiv:2305.04990, 2023

  36. [36]

    Advancing social intelligence in ai agents: Technical challenges and open questions.arXiv preprint arXiv:2404.11023, 2024

    Leena Mathur, Paul Pu Liang, and Louis-Philippe Morency. Advancing social intelligence in ai agents: Technical challenges and open questions.arXiv preprint arXiv:2404.11023, 2024

  37. [37]

    Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

    Leena Mathur, Marian Qian, Paul Pu Liang, and Louis- Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

  38. [38]

    Mixed-method long-term robot usage: Older adults’ lived experience of social robots

    Anastasia K Ostrowski, Cynthia Breazeal, and Hae Won Park. Mixed-method long-term robot usage: Older adults’ lived experience of social robots. In2022 17th ACM/IEEE international conference on human-robot interaction (HRI), pages 33–42. IEEE, 2022

  39. [39]

    Train- ing language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Train- ing language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  40. [40]

    Growing growth mindset with a social robot peer

    Hae Won Park, Rinat Rosenberg-Kima, Maor Rosenberg, Goren Gordon, and Cynthia Breazeal. Growing growth mindset with a social robot peer. InProceedings of the 2017 ACM/IEEE international conference on human-robot interaction, pages 137–145, 2017

  41. [41]

    Jibo community social robot research platform@ scale

    Hae Won Park, Cynthia Breazeal, Sharifa Alghowinem, Anastasia K Ostrowski, Jon Ferguson, Xiajie Zhang, and Dong Won Lee. Jibo community social robot research platform@ scale. InCompanion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, pages 1346–1348, 2024

  42. [42]

    Intro- ducing gemini 2.0: our new ai model for the agentic era, 2024

    Sundar Pichai, D Hassabis, and K Kavukcuoglu. Intro- ducing gemini 2.0: our new ai model for the agentic era, 2024

  43. [43]

    MELD: A multimodal multi-party dataset for emotion recognition in conversations

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, Florence, Ital...

  44. [44]

    Does self-rationalization improve robustness to spurious correlations?arXiv preprint arXiv:2210.13575, 2022

    Alexis Ross, Matthew E Peters, and Ana Marasovi ´c. Does self-rationalization improve robustness to spurious correlations?arXiv preprint arXiv:2210.13575, 2022

  45. [45]

    Atomic: An atlas of machine commonsense for if-then reasoning

    Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic: An atlas of machine commonsense for if-then reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 3027–3035, 2019

  46. [46]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions.arXiv preprint arXiv:1904.09728, 2019

  47. [47]

    Social ontology and the philosophy of society.Analyse & Kritik, 20(2):143–158, 1998

    John R Searle. Social ontology and the philosophy of society.Analyse & Kritik, 20(2):143–158, 1998

  48. [48]

    How well do large language models perform on faux pas tests? InFindings of the Association for Computational Linguistics: ACL 2023, pages 10438–10451, 2023

    Natalie Shapira, Guy Zwirn, and Yoav Goldberg. How well do large language models perform on faux pas tests? InFindings of the Association for Computational Linguistics: ACL 2023, pages 10438–10451, 2023

  49. [49]

    Memor: A dataset for multimodal emotion reasoning in videos

    Guangyao Shen et al. Memor: A dataset for multimodal emotion reasoning in videos. InProceedings of the 28th ACM International Conference on Multimedia, pages 4937–4945, 2020

  50. [50]

    Empathicstories++: A multimodal dataset for empathy towards personal experiences.arXiv preprint arXiv:2405.15708, 2024

    Jocelyn Shen, Yubin Kim, Mohit Hulse, Wazeer Zulfikar, Sharifa Alghowinem, Cynthia Breazeal, and Hae Won Park. Empathicstories++: A multimodal dataset for empathy towards personal experiences.arXiv preprint arXiv:2405.15708, 2024

  51. [51]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  52. [52]

    Emotion norms, emotion work, and social order

    Peggy A Thoits. Emotion norms, emotion work, and social order. InFeelings and emotions: The Amsterdam symposium, pages 359–378. Cambridge University Press Cambridge, UK, 2004

  53. [53]

    A taxonomy of social errors in human-robot interaction.ACM Transactions on Human-Robot Interaction (THRI), 10(2):1–32, 2021

    Leimin Tian and Sharon Oviatt. A taxonomy of social errors in human-robot interaction.ACM Transactions on Human-Robot Interaction (THRI), 10(2):1–32, 2021

  54. [54]

    Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023

    Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399, 2023

  55. [55]

    Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023

    Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023

  56. [56]

    Social-iq 2.0 challenge: Benchmarking mul- timodal social understanding.Social-iq 2.0 challenge: Benchmarking multimodal social understanding, 2023

    Alex Wilf, Leena Mathur, Sheryl Mathew, Claire Ko, Youssouf Kebe, Paul Pu Liang, and Louis-Philippe Morency. Social-iq 2.0 challenge: Benchmarking mul- timodal social understanding.Social-iq 2.0 challenge: Benchmarking multimodal social understanding, 2023

  57. [57]

    Coke: A cognitive knowledge graph for machine theory of mind.arXiv preprint arXiv:2305.05390, 2023

    Jincenzi Wu, Zhuang Chen, Jiawen Deng, Sahand Sabour, and Minlie Huang. Coke: A cognitive knowledge graph for machine theory of mind.arXiv preprint arXiv:2305.05390, 2023

  58. [58]

    Fine-grained human feedback gives better rewards for language model training

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36: 59008–59033, 2023

  59. [59]

    Fresco: Spatial-temporal correspondence for zero- shot video translation

    Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Fresco: Spatial-temporal correspondence for zero- shot video translation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8703–8712, 2024

  60. [60]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  61. [61]

    Social-iq: A question answering benchmark for artificial social intelligence

    Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019

  62. [62]

    Social-iq: A question answering benchmark for artificial social intelligence

    Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8807–8817, 2019

  63. [63]

    Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

    AmirAli Bagher Zadeh, Paul Pu Liang, Sahisnu Mazumder, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018

  64. [64]

    Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

    AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018

  65. [65]

    Investigating the catastrophic forgetting in multimodal large language models.arXiv preprint arXiv:2309.10313, 2023

    Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models.arXiv preprint arXiv:2309.10313, 2023

  66. [66]

    Llava-next: A strong zero-shot video understanding model, April 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024. URL https://llava-vl.github.io/blog/ 2024-04-30-llava-next-video/

  67. [67]

    Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  68. [68]

    Sotopia: Interactive evaluation for social intelligence in language agents, 2024

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents, 2024. URL https://arxiv.org/abs/2310. 11667. APPENDIX We employed two independent annotators for every video segment, cove...

  69. [69]

    • Social Error:Behaviors that violate social norms and degrade a user’s perception of the robot’s socio-affective competence, such as interrupting at inappropriate times

    None Definitions: • Social Competence:The ability to successfully conduct social interactions, which depends on the awareness and identification of social-emotional cues, the ability to process such cues, and the ability to decide on and express a normative response. • Social Error:Behaviors that violate social norms and degrade a user’s perception of the...

  70. [70]

    • None:Neither a social error nor competence is observed

    None Definitions: • Social Error:Behaviors that violate social norms and degrade a user’s perception of the robot’s socio-affective competence, such as interrupting at inappropriate times. • None:Neither a social error nor competence is observed. Answer the above from the following Images and Conversation History: {Interaction Transcript} Prompt Example 3...

  71. [77]

    You are given theImages and Conversation History between a social robotic agent (Jibo) and a participant

    Social Norms: Recognizing accepted behaviors and violations in social settings Answer the above from the following Images and Conversation History: {Interaction Transcript} Prompt Example 4: Multiple Social Attribute Presence (Well- ness Dataset) The social robotic agent is designed to be a social positive psychology coach that delivers interactive positi...

  72. [78]

    Emotions: The ability to identify and interpret emotional expressions in oneself and others

  73. [79]

    Engagement: Observing and assessing levels of participation and interest

  74. [80]

    Conversational Mechanics: Understanding turn- taking, interruptions, and conversational flow

  75. [81]

    Knowledge State: Assessing what others know or believe in context

  76. [82]

    Intention: Inferring the goals or purposes behind others’ actions or speech

  77. [83]

    Social Relationships: Understanding interpersonal dynamics and their context

  78. [84]

    Respond with True if the behavior demonstrates more than one social attribute

    Social Norms: Recognizing accepted behaviors and violations in social settings Task:Based on the transcript, determine whether the agent’s behavior involvesmultiple social attributes. Respond with True if the behavior demonstrates more than one social attribute. Respond with False if the behavior is based on only a single attribute. Answer the above from ...

  79. [85]

    I’ll tell you about that next week

    Participant:Not yet. I’ll tell you about that next week. 2)Participant:Let’s see. Let’s see

  80. [86]

    Today I took a walk around the building that I work in

    Participant:Yes. Today I took a walk around the building that I work in. I took the stairs all the way down four floors, and then all the way back up so that I could recharge to get back to work

Showing first 80 references.