Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning
Pith reviewed 2026-05-22 21:11 UTC · model grok-4.3
The pith
Foundation models exhibit substantial performance gaps in recognizing social deficits during human-robot interactions on the SHREC benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The SHREC Dataset is a benchmark of approximately 400 real-world human-robot interaction videos and over 10K annotations that capture robot social errors, competencies, underlying rationales, and corrections. It defines eight benchmark tasks targeting detection of social errors and competencies, identification of underlying social attributes, comprehension of interaction flow, and provision of rationale and alternative correct actions. Experiments with state-of-the-art foundation models reveal substantial performance gaps relative to human evaluators.
What carries the argument
The SHREC Dataset together with its eight benchmark tasks that measure social reasoning capabilities in human-robot interactions.
If this is right
- Foundation models require improvements in emotion understanding, intention tracking, and conversational mechanics for robot contexts.
- The dataset highlights social challenges unique to human-robot interactions that prior human-human datasets do not address.
- Directions emerge for developing socially intelligent AI by targeting the identified failure modes.
- The eight tasks provide concrete evaluation criteria for tracking progress in social reasoning.
Where Pith is reading between the lines
- Robots built with models trained to close these gaps could produce more natural daily interactions.
- Similar video-based benchmarks could be applied to other embodied settings such as assistive devices or autonomous systems.
- The performance gaps suggest that scaling alone may not suffice without explicit social-situation training data.
Load-bearing premise
The annotations and task definitions in the SHREC dataset accurately capture the social attributes, rationales, and corrections needed for real-world human-robot social reasoning.
What would settle it
A state-of-the-art foundation model achieving performance levels comparable to human evaluators across all eight tasks on the SHREC videos would falsify the reported substantial performance gaps.
Figures
read the original abstract
Our work focuses on the social reasoning capabilities of foundation models for real-world human-robot interactions. We introduce the Social Human Robot Embodied Conversation (SHREC) Dataset, a benchmark of $\sim$400 real-world human-robot interaction videos and over 10K annotations, capturing robot social errors, competencies, underlying rationales, and corrections. Unlike prior datasets focused on human-human interactions, the SHREC Dataset uniquely highlights the social challenges faced by real-world social robots such as emotion understanding, intention tracking, and conversational mechanics. Moreover, current foundation models struggle to recognize these deficits, which manifest as subtle, socially situated failures. To evaluate AI models' capacity for social reasoning, we define eight benchmark tasks targeting critical areas such as (1) detection of social errors and competencies, (2) identification of underlying social attributes, (3) comprehension of interaction flow, and (4) providing rationale and alternative correct actions. Experiments with state-of-the-art foundation models, alongside human evaluations, reveal substantial performance gaps -- underscoring the difficulty and providing directions in developing socially intelligent AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the SHREC Dataset of ~400 real-world human-robot interaction videos and over 10K annotations capturing robot social errors, competencies, rationales, and corrections. It defines eight benchmark tasks targeting detection of social errors/competencies, identification of social attributes, comprehension of interaction flow, and provision of rationales/corrections. Experiments with state-of-the-art foundation models and human evaluations report substantial performance gaps, underscoring difficulties in social reasoning for embodied HRI.
Significance. If the annotations are shown to be reliable and the tasks validly isolate real social attributes in HRI, the work would supply a needed benchmark distinct from human-human datasets, offering concrete directions for improving foundation models on subtle, situated social failures such as emotion understanding and intention tracking.
major comments (1)
- §3 (Dataset Construction) and §4 (Benchmark Tasks): The manuscript supplies no quantitative details on the annotation protocol (number of annotators per item, training, adjudication) or inter-rater reliability metrics (Cohen/Fleiss kappa or equivalent) for the ~10K annotations. This directly undermines the central claim of model limitations, because observed gaps on the eight tasks could arise from noisy or subjective labels rather than genuine shortfalls in social reasoning.
minor comments (2)
- Abstract: The phrase 'substantial performance gaps' is stated without accompanying numerical results (e.g., accuracy or F1 differences between models and humans); adding these would strengthen the summary.
- §5 (Experiments): Clarify the exact prompting format and output parsing procedure used for each of the eight tasks to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for noting the potential significance of the SHREC dataset. We address the major comment on annotation protocol and reliability below.
read point-by-point responses
-
Referee: §3 (Dataset Construction) and §4 (Benchmark Tasks): The manuscript supplies no quantitative details on the annotation protocol (number of annotators per item, training, adjudication) or inter-rater reliability metrics (Cohen/Fleiss kappa or equivalent) for the ~10K annotations. This directly undermines the central claim of model limitations, because observed gaps on the eight tasks could arise from noisy or subjective labels rather than genuine shortfalls in social reasoning.
Authors: We agree that quantitative details on the annotation protocol and inter-rater reliability metrics are essential to establish label quality and support the validity of the benchmark tasks. The initial manuscript omitted these specifics. In the revised version, we will expand §3 to report the number of annotators per item, annotator training and adjudication procedures, and inter-rater reliability metrics (e.g., Fleiss' kappa) for the annotations. These additions will allow assessment of whether the observed model performance gaps reflect genuine social reasoning challenges. revision: yes
Circularity Check
Empirical dataset and benchmark paper with no derivations or self-referential predictions
full rationale
The paper introduces the SHREC dataset of ~400 HRI videos and >10K annotations, defines eight benchmark tasks, and reports model performance gaps. No equations, fitted parameters, or derivation chains appear in the provided text. The central claims rest on empirical annotation and evaluation rather than any self-definition, fitted-input-as-prediction, or self-citation load-bearing step. External model evaluations and human comparisons serve as independent benchmarks. This is the normal case of a self-contained empirical contribution; no circularity is exhibited.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The eight benchmark tasks accurately target and measure critical areas of social reasoning in human-robot interactions.
Reference graph
Works this paper leans on
-
[1]
Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174, 2025
-
[2]
Inter-coder agreement for computational linguistics.Computational linguistics, 34(4):555–596, 2008
Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics.Computational linguistics, 34(4):555–596, 2008
work page 2008
-
[3]
Simon Baron-Cohen, Michelle O’Riordan, Rosie Jones, Valerie Stone, and Kate Plaisted. A new test of social sensitivity: Detection of faux pas in normal children and children with asperger syndrome.Journal of Autism and Developmental Disorders, 29(5):407–418, 1999
work page 1999
-
[4]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Sree Bhattacharyya and James Z. Wang. Evaluating vision-language models for emotion recognition. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 1798–1820, Albuquerque, New Mexico, April
work page 2025
-
[6]
Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology.org/2025. findings-naacl.97/
work page 2025
-
[7]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185– 24198, 2024
work page 2024
-
[8]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Samuel Colvin, Eric Jolibois, Hasan Ramezani, Adrian Garcia Badaracco, Terrence Dorsey, David Montague, Serge Matveenko, Marcelo Trylesinski, Sydney Runkle, David Hewitt, Alex Hall, and Victorien Plot. Pydantic, 1
-
[10]
URL https://docs.pydantic.dev/latest/
-
[11]
Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, et al. Aya vision: Advancing the frontier of multilingual multimodality.arXiv preprint arXiv:2505.08751, 2025
-
[12]
Commonsense reasoning and commonsense knowledge in artificial intelligence
Ernest Davis and Gary Marcus. Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9):92–103, 2015
work page 2015
-
[13]
Interpersonal reactivity index.Journal of Personality and Social Psychology, 1980
Mark H Davis. Interpersonal reactivity index.Journal of Personality and Social Psychology, 1980
work page 1980
-
[14]
Socratis: Are large multimodal models emotionally aware?arXiv preprint arXiv:2308.16741, 2023
Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A Plummer, and Kate Saenko. Socratis: Are large multimodal models emotionally aware?arXiv preprint arXiv:2308.16741, 2023
-
[15]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[16]
Isabel Dziobek, Stefan Fleck, Elke Kalbe, Kimberley Rogers, Jason Hassenstab, Matthias Brand, Josef Kessler, Jan K Woike, Oliver T Wolf, and Antonio Convit. Introducing masc: a movie for the assessment of social cognition.Journal of autism and developmental disorders, 36:623–636, 2006
work page 2006
-
[17]
Repairing trust in robots?: A meta-analysis of hri trust repair studies with a no-repair condition
Connor Esterwood and Lionel P Robert. Repairing trust in robots?: A meta-analysis of hri trust repair studies with a no-repair condition. In2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 410–419. IEEE, 2025
work page 2025
-
[18]
The artificial- social-agent questionnaire: establishing the long and short questionnaire versions
Siska Fitrianie, Merijn Bruijnes, Fengxiang Li, Amal Abdulrahman, and Willem-Paul Brinkman. The artificial- social-agent questionnaire: establishing the long and short questionnaire versions. InProceedings of the 22nd ACM International Conference on Intelligent Virtual Agents, pages 1–8, 2022
work page 2022
-
[19]
Riccardo Fusaroli and Kristian Tylén. Investigating con- versational dynamics: Interactive alignment, interpersonal synergy, and collective task performance.Cognitive science, 40(1):145–171, 2016
work page 2016
-
[20]
Manuel Giuliani, Nicole Mirnig, Gerald Stollnberger, Susanne Stadler, Roland Buchner, and Manfred Tscheligi. Systematic analysis of video data from different human– robot interaction studies: a categorization of social signals during error situations.Frontiers in psychology, 6:931, 2015
work page 2015
-
[21]
Ofer Golan, Simon Baron-Cohen, Jacqueline J Hill, and Yael Golan. The “reading the mind in films” task: complex emotion recognition in adults with and without autism spectrum conditions.Social Neuroscience,, 1(2):111–123, 2006
work page 2006
-
[22]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Affective social competence.Social development, 10(1):79–119, 2001
Amy G Halberstadt, Susanne A Denham, and Julie C Dun- smore. Affective social competence.Social development, 10(1):79–119, 2001
work page 2001
-
[25]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Misbah Javaid and Vladimir Estivill-Castro. Explanations from a robotic partner build trust on the robot’s decisions for collaborative human-humanoid interaction.Robotics, 10(1):51, 2021
work page 2021
-
[28]
A robotic positive psychology coach to improve college students’ wellbeing
Sooyeon Jeong, Sharifa Alghowinem, Laura Aymerich- Franch, Kika Arias, Agata Lapedriza, Rosalind Picard, Hae Won Park, and Cynthia Breazeal. A robotic positive psychology coach to improve college students’ wellbeing. In2020 29th IEEE international conference on robot and human interactive communication (RO-MAN), pages 187–194. IEEE, 2020
work page 2020
-
[29]
Sooyeon Jeong, Laura Aymerich-Franch, Sharifa Al- ghowinem, Rosalind W Picard, Cynthia L Breazeal, and Hae Won Park. A robotic companion for psychological well-being: A long-term investigation of companionship and therapeutic alliance. InProceedings of the 2023 ACM/IEEE international conference on human-robot interaction, pages 485–494, 2023
work page 2023
-
[30]
Sooyeon Jeong, Laura Aymerich-Franch, Kika Arias, Sharifa Alghowinem, Agata Lapedriza, Rosalind Picard, Hae Won Park, and Cynthia Breazeal. Deploying a robotic positive psychology coach to improve college students’ psychological well-being.User Modeling and User-Adapted Interaction, 33(2):571–615, 2023
work page 2023
-
[31]
Trust repair in human-agent teams: the effectiveness of explanations and expressing regret
Esther S Kox, José H Kerstholt, Tom F Hueting, and Peter W de Vries. Trust repair in human-agent teams: the effectiveness of explanations and expressing regret. Autonomous agents and multi-agent systems, 35(2):30, 2021
work page 2021
-
[32]
Building machines that learn and think like people.Behavioral and brain sciences, 40: e253, 2017
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40: e253, 2017
work page 2017
-
[33]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[34]
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Josh Magnus Ludan, Yixuan Meng, Tai Nguyen, Saurabh Shah, Qing Lyu, Marianna Apidianaki, and Chris Callison- Burch. Explanation-based finetuning makes models more robust to spurious cues.arXiv preprint arXiv:2305.04990, 2023
-
[36]
Leena Mathur, Paul Pu Liang, and Louis-Philippe Morency. Advancing social intelligence in ai agents: Technical challenges and open questions.arXiv preprint arXiv:2404.11023, 2024
-
[37]
Leena Mathur, Marian Qian, Paul Pu Liang, and Louis- Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025
-
[38]
Mixed-method long-term robot usage: Older adults’ lived experience of social robots
Anastasia K Ostrowski, Cynthia Breazeal, and Hae Won Park. Mixed-method long-term robot usage: Older adults’ lived experience of social robots. In2022 17th ACM/IEEE international conference on human-robot interaction (HRI), pages 33–42. IEEE, 2022
work page 2022
-
[39]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Train- ing language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[40]
Growing growth mindset with a social robot peer
Hae Won Park, Rinat Rosenberg-Kima, Maor Rosenberg, Goren Gordon, and Cynthia Breazeal. Growing growth mindset with a social robot peer. InProceedings of the 2017 ACM/IEEE international conference on human-robot interaction, pages 137–145, 2017
work page 2017
-
[41]
Jibo community social robot research platform@ scale
Hae Won Park, Cynthia Breazeal, Sharifa Alghowinem, Anastasia K Ostrowski, Jon Ferguson, Xiajie Zhang, and Dong Won Lee. Jibo community social robot research platform@ scale. InCompanion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, pages 1346–1348, 2024
work page 2024
-
[42]
Intro- ducing gemini 2.0: our new ai model for the agentic era, 2024
Sundar Pichai, D Hassabis, and K Kavukcuoglu. Intro- ducing gemini 2.0: our new ai model for the agentic era, 2024
work page 2024
-
[43]
MELD: A multimodal multi-party dataset for emotion recognition in conversations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, Florence, Ital...
-
[44]
Alexis Ross, Matthew E Peters, and Ana Marasovi ´c. Does self-rationalization improve robustness to spurious correlations?arXiv preprint arXiv:2210.13575, 2022
-
[45]
Atomic: An atlas of machine commonsense for if-then reasoning
Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic: An atlas of machine commonsense for if-then reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 3027–3035, 2019
work page 2019
-
[46]
SocialIQA: Commonsense Reasoning about Social Interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions.arXiv preprint arXiv:1904.09728, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[47]
Social ontology and the philosophy of society.Analyse & Kritik, 20(2):143–158, 1998
John R Searle. Social ontology and the philosophy of society.Analyse & Kritik, 20(2):143–158, 1998
work page 1998
-
[48]
Natalie Shapira, Guy Zwirn, and Yoav Goldberg. How well do large language models perform on faux pas tests? InFindings of the Association for Computational Linguistics: ACL 2023, pages 10438–10451, 2023
work page 2023
-
[49]
Memor: A dataset for multimodal emotion reasoning in videos
Guangyao Shen et al. Memor: A dataset for multimodal emotion reasoning in videos. InProceedings of the 28th ACM International Conference on Multimedia, pages 4937–4945, 2020
work page 2020
-
[50]
Jocelyn Shen, Yubin Kim, Mohit Hulse, Wazeer Zulfikar, Sharifa Alghowinem, Cynthia Breazeal, and Hae Won Park. Empathicstories++: A multimodal dataset for empathy towards personal experiences.arXiv preprint arXiv:2405.15708, 2024
-
[51]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Emotion norms, emotion work, and social order
Peggy A Thoits. Emotion norms, emotion work, and social order. InFeelings and emotions: The Amsterdam symposium, pages 359–378. Cambridge University Press Cambridge, UK, 2004
work page 2004
-
[53]
Leimin Tian and Sharon Oviatt. A taxonomy of social errors in human-robot interaction.ACM Transactions on Human-Robot Interaction (THRI), 10(2):1–32, 2021
work page 2021
-
[54]
Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023
Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399, 2023
-
[55]
Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023
work page 2023
-
[56]
Alex Wilf, Leena Mathur, Sheryl Mathew, Claire Ko, Youssouf Kebe, Paul Pu Liang, and Louis-Philippe Morency. Social-iq 2.0 challenge: Benchmarking mul- timodal social understanding.Social-iq 2.0 challenge: Benchmarking multimodal social understanding, 2023
work page 2023
-
[57]
Coke: A cognitive knowledge graph for machine theory of mind.arXiv preprint arXiv:2305.05390, 2023
Jincenzi Wu, Zhuang Chen, Jiawen Deng, Sahand Sabour, and Minlie Huang. Coke: A cognitive knowledge graph for machine theory of mind.arXiv preprint arXiv:2305.05390, 2023
-
[58]
Fine-grained human feedback gives better rewards for language model training
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36: 59008–59033, 2023
work page 2023
-
[59]
Fresco: Spatial-temporal correspondence for zero- shot video translation
Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Fresco: Spatial-temporal correspondence for zero- shot video translation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8703–8712, 2024
work page 2024
-
[60]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Social-iq: A question answering benchmark for artificial social intelligence
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019
work page 2019
-
[62]
Social-iq: A question answering benchmark for artificial social intelligence
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8807–8817, 2019
work page 2019
-
[63]
Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph
AmirAli Bagher Zadeh, Paul Pu Liang, Sahisnu Mazumder, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018
work page 2018
-
[64]
Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018
work page 2018
-
[65]
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models.arXiv preprint arXiv:2309.10313, 2023
-
[66]
Llava-next: A strong zero-shot video understanding model, April 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024. URL https://llava-vl.github.io/blog/ 2024-04-30-llava-next-video/
work page 2024
-
[67]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging llm-as- a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[68]
Sotopia: Interactive evaluation for social intelligence in language agents, 2024
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents, 2024. URL https://arxiv.org/abs/2310. 11667. APPENDIX We employed two independent annotators for every video segment, cove...
work page 2024
-
[69]
None Definitions: • Social Competence:The ability to successfully conduct social interactions, which depends on the awareness and identification of social-emotional cues, the ability to process such cues, and the ability to decide on and express a normative response. • Social Error:Behaviors that violate social norms and degrade a user’s perception of the...
-
[70]
• None:Neither a social error nor competence is observed
None Definitions: • Social Error:Behaviors that violate social norms and degrade a user’s perception of the robot’s socio-affective competence, such as interrupting at inappropriate times. • None:Neither a social error nor competence is observed. Answer the above from the following Images and Conversation History: {Interaction Transcript} Prompt Example 3...
-
[77]
Social Norms: Recognizing accepted behaviors and violations in social settings Answer the above from the following Images and Conversation History: {Interaction Transcript} Prompt Example 4: Multiple Social Attribute Presence (Well- ness Dataset) The social robotic agent is designed to be a social positive psychology coach that delivers interactive positi...
-
[78]
Emotions: The ability to identify and interpret emotional expressions in oneself and others
-
[79]
Engagement: Observing and assessing levels of participation and interest
-
[80]
Conversational Mechanics: Understanding turn- taking, interruptions, and conversational flow
-
[81]
Knowledge State: Assessing what others know or believe in context
-
[82]
Intention: Inferring the goals or purposes behind others’ actions or speech
-
[83]
Social Relationships: Understanding interpersonal dynamics and their context
-
[84]
Respond with True if the behavior demonstrates more than one social attribute
Social Norms: Recognizing accepted behaviors and violations in social settings Task:Based on the transcript, determine whether the agent’s behavior involvesmultiple social attributes. Respond with True if the behavior demonstrates more than one social attribute. Respond with False if the behavior is based on only a single attribute. Answer the above from ...
-
[85]
I’ll tell you about that next week
Participant:Not yet. I’ll tell you about that next week. 2)Participant:Let’s see. Let’s see
-
[86]
Today I took a walk around the building that I work in
Participant:Yes. Today I took a walk around the building that I work in. I took the stairs all the way down four floors, and then all the way back up so that I could recharge to get back to work
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.