pith. sign in

arxiv: 2506.05425 · v3 · submitted 2025-06-05 · 💻 cs.CV · cs.AI

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Pith reviewed 2026-05-19 11:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video benchmarkmultimodal large language modelssocial interactionsocial relation theoryscene understandingstate reasoningdynamics prediction
0
0 comments X

The pith

Multimodal language models grasp social scenes in videos but falter when inferring mental states or predicting behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SIV-Bench as a new video dataset to test how well multimodal large language models understand social interactions. It divides the task into three levels: recognizing visible elements in a scene, reasoning about hidden mental states and relationships, and forecasting what people will do next. Experiments across thousands of clips show models manage the first level reasonably well yet consistently struggle with the deeper reasoning steps. The main obstacle appears to be confusion when identifying which social relation applies between people. The benchmark draws on social relation theory and includes varied video styles and cultural contexts to expose these specific gaps.

Core claim

SIV-Bench supplies 2,792 video clips and 5,455 question-answer pairs that evaluate multimodal large language models on Social Scene Understanding, Social State Reasoning, and Social Dynamics Prediction. Models achieve better results on basic scene understanding yet remain weak on state reasoning and dynamics prediction, with systematic errors in relation inference emerging as the central limitation. Further examination links the shortfalls to misalignment with human reasoning patterns and insufficient depth in step-by-step inference, while audio and subtitles improve outcomes on the more demanding tasks.

What carries the argument

SIV-Bench, a collection of originally sourced video clips paired with questions generated through a human-LLM pipeline and organized around social relation theory to measure three progressive capabilities: scene understanding, state reasoning, and dynamics prediction.

If this is right

  • Audio and subtitles supply helpful signals specifically for the harder reasoning and prediction tasks.
  • Confusion over which relationship holds between people blocks progress on both state reasoning and behavior prediction.
  • Model outputs often diverge from the sequence of inferences humans follow when watching the same clips.
  • Performance gaps persist across different video lengths, genres, and linguistic backgrounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models may need explicit modules for tracking and updating social relations over time rather than inferring them anew each frame.
  • The same evaluation approach could be applied to live video streams to test whether the observed weaknesses appear in real-time settings.
  • If relation inference improves, gains should appear first in the prediction tasks that depend on accurate relationship models.

Load-bearing premise

The question-answer pairs created by the human-LLM pipeline accurately represent ordinary human judgments about social relations, mental states, and behavior predictions.

What would settle it

Collect fresh human ratings on a random subset of the videos and compare them directly to the existing ground-truth answers; large mismatches would indicate the benchmark questions do not track typical human social judgments.

Figures

Figures reproduced from arXiv: 2506.05425 by Fanqi Kong, Song-Chun Zhu, Weiqin Zu, Xinyu Chen, Xue Feng, Yaodong Yang.

Figure 1
Figure 1. Figure 1: Overview of SIV-Bench, showing its diverse videos spanning various social interactions and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SIV-Bench construction pipeline, detailing the data collection process (left), and the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Video statistics for SIV-Bench: (a) Distribution of social relation types. (b) Distribution of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the three sub￾title conditions applied in SIV-Bench: ‘Origin’ (original video with existing on￾screen text), ‘-Subtitle’ (original text re￾moved), and ‘+Subtitle’ (transcribed and translated dialogue added). While most fall in the 10–20 second range, the dataset spans a wide distribution, including many short clips (un￾der 10 seconds) and a significant number over 60 seconds. Linguistically… view at source ↗
Figure 5
Figure 5. Figure 5: Detailed statistics of Question￾Answer pairs in SIV-Bench, showing the dis￾tribution across the 10 fine-grained sub-tasks. Subset 2: We recruit a team of 20 human annotators to verify the non-consensus items. Each QA-pair is reviewed by at least two people. We only retain questions for which all reviewing annotators inde￾pendently and unanimously select the same answer option, which is then established as … view at source ↗
Figure 6
Figure 6. Figure 6: Radar chart illustrating the comparative [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Word clouds of GPT-generated keywords used for sourcing videos across 14 distinct social [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples illustrating the diverse video genres in SIV-Bench, including (from left to right) [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of diverse video presentation styles featured in SIV-Bench, including (from left [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt used to guide Gemini for the initial generation of question-answer pairs. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for the video-agnostic filtering of Question-Answer pairs. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt used to score the difficulty of SIV-Bench QA pairs on a 1-to-5 scale, based [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt used to instruct LLM for the final classification of Question-Answer pairs into one [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Screenshot of the guidelines provided to human annotators, detailing the tasks of answering [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example of the web-based interface used by human annotators for watching videos, [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Statistical analysis of SIV-Bench Question-Answer (QA) pairs. (a) average word count [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Standardized prompt templates used for evaluating MLLMs on SIV-Bench. Separate [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Examples of failure cases in Social Scene Understanding (SSU) tasks, including errors in [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Examples of failure cases in Social State Reasoning (SSR) tasks, highlighting difficulties [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Examples of failure cases in Social Dynamics Prediction (SDP) tasks, covering both [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
read the original abstract

Understanding social interaction, which encompasses perceiving numerous and subtle multimodal cues, inferring unobservable mental states and relations, and dynamically predicting others' behavior, is the foundation for achieving human-machine interaction. Despite rapid advances in Multimodal Large Language Models (MLLMs), the rich and multifaceted nature of social interaction has hindered the development of benchmarks that holistically evaluate and guide their social interaction abilities. Based on social relation theory, which has been widely regarded as a foundational framework for understanding social behavior, we provide SIV-Bench, a novel video benchmark for systematically evaluating MLLMs' capabilities across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 originally collected video clips and 5,455 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It covers 14 typical relationships, diverse video lengths, genres, presentation styles, and linguistic and cultural backgrounds. Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck. An in-depth analysis of the reasoning process attributes MLLMs' suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. Moreover, we find audio and subtitles aid in reasoning-intensive SSR and SDP. Together, SIV-Bench offers a unified testbed to measure progress, expose limitations, and guide future research toward more socially intelligent MLLMs. We release the dataset and code at our project website: https://kfq20.github.io/sivbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces SIV-Bench, a video benchmark with 2,792 originally collected clips and 5,455 QA pairs generated via a human-LLM collaborative pipeline. Grounded in social relation theory, it evaluates MLLMs across three tasks: Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP), covering 14 relationships and diverse video characteristics. Experiments indicate leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with systematic confusion in relation inference as a key bottleneck; audio and subtitles are shown to aid reasoning-intensive tasks, and analysis attributes suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. The dataset and code are released publicly.

Significance. If the QA pairs reliably instantiate social relation theory and match human judgments, the benchmark would provide a valuable unified testbed for measuring progress in MLLM social intelligence, exposing specific gaps in reasoning and relation inference while highlighting the utility of multimodal cues like audio. The public release of data and code supports reproducibility and further work toward socially aware models.

major comments (1)
  1. Abstract: The headline finding that MLLMs are relatively strong on SSU yet weak on SSR/SDP (with relation inference as bottleneck) rests on the assumption that the human-LLM collaborative pipeline produces QA pairs that faithfully instantiate social relation theory and match human judgments. The abstract describes the pipeline and coverage of 14 relationships but supplies no inter-annotator agreement, expert adjudication rate, or comparison against purely human annotations for the 2,792 clips. Without these, systematic model errors could partly reflect annotation artifacts rather than genuine capability gaps, especially on reasoning-heavy SSR and SDP items.
minor comments (1)
  1. The manuscript would benefit from a table or figure summarizing the distribution of the 5,455 QA pairs across SSU, SSR, and SDP categories to clarify task balance and coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comment raises a valid point about strengthening the evidence for the quality and human alignment of our QA pairs. We address this below and will incorporate revisions to improve the paper.

read point-by-point responses
  1. Referee: Abstract: The headline finding that MLLMs are relatively strong on SSU yet weak on SSR/SDP (with relation inference as bottleneck) rests on the assumption that the human-LLM collaborative pipeline produces QA pairs that faithfully instantiate social relation theory and match human judgments. The abstract describes the pipeline and coverage of 14 relationships but supplies no inter-annotator agreement, expert adjudication rate, or comparison against purely human annotations for the 2,792 clips. Without these, systematic model errors could partly reflect annotation artifacts rather than genuine capability gaps, especially on reasoning-heavy SSR and SDP items.

    Authors: We agree that explicit quantitative validation of the human-LLM pipeline is important to support our claims about model performance gaps. The full manuscript already details the multi-stage human oversight process (including expert review and correction of LLM-generated QA pairs), but we acknowledge that inter-annotator agreement metrics, adjudication rates, and direct human-only comparisons were not reported. In the revised manuscript we will add a dedicated quality validation subsection reporting: (1) inter-annotator agreement on a sampled subset of QA pairs, (2) expert adjudication statistics, and (3) consistency results from an independent human re-annotation study on a portion of the clips. We will also update the abstract to reference these validation steps. These additions will directly address the concern that annotation artifacts might explain observed model weaknesses. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark rests on external theory and new data collection

full rationale

The paper constructs SIV-Bench by collecting 2,792 new video clips and generating 5,455 QA pairs via a human-LLM pipeline explicitly grounded in external social relation theory. No equations, fitted parameters, or predictions appear in the provided text. Core claims about MLLM performance on SSU/SSR/SDP are empirical results measured against this independently created benchmark rather than any reduction to self-referential inputs or self-citation chains. The derivation chain is self-contained against external benchmarks and theory with no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that social relation theory adequately structures social interaction and that the human-LLM pipeline yields faithful ground-truth labels; no free parameters or new invented entities are introduced.

axioms (1)
  • domain assumption Social relation theory is a foundational framework for understanding social behavior
    The benchmark is explicitly constructed around this theory as stated in the abstract.

pith-pipeline@v0.9.0 · 5837 in / 1296 out tokens · 76123 ms · 2026-05-19T11:32:45.286548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    SIV-Bench is fundamentally organized around social relationships, recognizing their critical role in shaping social interaction [47, 5, 16]. Specifically, SIV-Bench is built on Fiske’s Relational Models Theory [12], categorizing social interactions via four foundational models (Communal Sharing, Authority Ranking, Equality Matching, and Market Pricing)

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

    cs.CV 2026-05 unverdicted novelty 7.0

    GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.

  2. SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 2 Pith papers · 16 internal anchors

  1. [1]

    I. M. Alabdulmohsin, B. Neyshabur, and X. Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022

  2. [2]

    Amiri-Margavi, I

    A. Amiri-Margavi, I. Jebellat, E. Jebellat, and S. P. M. Davoudi. Enhancing answer reliability through inter-model consensus of large language models. arXiv preprint arXiv:2411.16797, 2024

  3. [3]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Berger, B

    J. Berger, B. P. Cohen, and M. Zelditch Jr. Status characteristics and social interaction.American sociological review, pages 241–255, 1972

  5. [5]

    I. Burkitt. Social relationships and emotions. Sociology, 31(1):37–55, 1997

  6. [6]

    R. M. Byrne. Counterfactual thought. Annual review of psychology, 67(1):135–157, 2016

  7. [7]

    H. Chen, W. Ji, L. Xu, and S. Zhao. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151, 2023

  8. [8]

    J. C.-Y . Chen, S. Saha, and M. Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023

  9. [9]

    T. Doshi. Build rich, interactive web apps with an updated gemini 2.5 pro. https://blog. google/products/gemini/gemini-2-5-pro-updates/ , May 2025. Accessed: 2025-05- 09

  10. [10]

    H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  11. [11]

    X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen. Mmbench-video: A long- form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

  12. [12]

    A. P. Fiske. The four elementary forms of sociality: framework for a unified theory of social relations. Psychological review, 99(4):689, 1992

  13. [13]

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024

  14. [14]

    Start building with Gemini 2.5 Flash

    Google. Start building with Gemini 2.5 Flash. Google for Develop- ers Blog, Apr. 2025. URL https://developers.googleblog.com/en/ start-building-with-gemini-25-flash/ . Accessed: May 9, 2025

  15. [15]

    Gemini 2.0: Flash, flash-lite and pro

    Google. Gemini 2.0: Flash, flash-lite and pro. https://developers.googleblog.com/en/ gemini-2-family-expands/ , February 2025. Accessed: 2025-05-09

  16. [16]

    W. W. Hartup. Social relationships and their developmental significance.American psychologist, 44(2):120, 1989

  17. [17]

    L. Hong, Z. Liu, W. Chen, C. Tan, Y . Feng, X. Zhou, P. Guo, J. Li, Z. Chen, S. Gao, et al. Lvos: A benchmark for large-scale long-term video object segmentation. arXiv preprint arXiv:2404.19326, 2024. 10

  18. [18]

    G. Hou, W. Zhang, Y . Shen, Z. Tan, S. Shen, and W. Lu. Entering real social world! bench- marking the theory of mind and socialization capabilities of llms from a first-person perspective. arXiv preprint arXiv:2410.06195, 2024

  19. [19]

    Huang, X

    Y . Huang, X. Wang, H. Liu, F. Kong, A. Qin, M. Tang, X. Wang, S.-C. Zhu, M. Bi, S. Qi, et al. Adasociety: An adaptive environment with social structures for multi-agent decision-making. arXiv preprint arXiv:2411.03865, 2024

  20. [20]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  21. [21]

    Y . Jin, M. Choi, G. Verma, J. Wang, and S. Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. arXiv preprint arXiv:2402.14154, 2024

  22. [22]

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

  23. [23]

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

  24. [24]

    K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  25. [25]

    Y . Li, X. Chen, B. Hu, L. Wang, H. Shi, and M. Zhang. Videovista: A versatile benchmark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024

  26. [26]

    X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3566–3574, 2019

  27. [27]

    Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024

  28. [28]

    Mangalam, R

    K. Mangalam, R. Akshulakov, and J. Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023

  29. [29]

    Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

    L. Mathur, M. Qian, P. P. Liang, and L.-P. Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025

  30. [30]

    D. W. Maynard and A. Peräkylä. Language and social interaction. In Handbook of social psychology, pages 233–257. Springer, 2003

  31. [31]

    Mollahosseini, B

    A. Mollahosseini, B. Hasani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1): 18–31, 2017

  32. [32]

    M. Ning, B. Zhu, Y . Xie, B. Lin, J. Cui, L. Yuan, D. Chen, and L. Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023

  33. [33]

    OpenAI o3 and o4-mini System Card

    OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. Ac- cessed: 2025-05-09. The direct PDF can also be found at https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

  34. [34]

    Patraucean, L

    V . Patraucean, L. Smaira, A. Gupta, A. Recasens, L. Markeeva, D. Banarse, S. Koppula, M. Malinowski, Y . Yang, C. Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36:42748– 42761, 2023

  35. [35]

    MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

    S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018. 11

  36. [36]

    J. Qi, J. Yu, T. Tu, K. Gao, Y . Xu, X. Guan, X. Wang, B. Xu, L. Hou, J. Li, et al. Goal: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 5391–5395, 2023

  37. [37]

    Qiang, Y

    R. Qiang, Y . Zhuang, Y . Li, D. S. V . K, R. Zhang, C. Li, I. S.-H. Wong, S. Yang, P. Liang, C. Zhang, and B. Dai. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering, 2025. URL https://arxiv.org/abs/2505.07782

  38. [38]

    Radford, J

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.org/abs/2212. 04356

  39. [39]

    Ramnani and R

    N. Ramnani and R. C. Miall. A system in the human brain for predicting the actions of others. Nature neuroscience, 7(1):85–90, 2004

  40. [40]

    Cinepile: A long video question answering dataset and benchmark

    R. Rawal, K. Saifullah, M. Farré, R. Basri, D. Jacobs, G. Somepalli, and T. Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024

  41. [41]

    Shinoda, N

    K. Shinoda, N. Hojo, K. Nishida, S. Mizuno, K. Suzuki, R. Masumura, H. Sugiyama, and K. Saito. Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind. arXiv preprint arXiv:2501.08838, 2025

  42. [42]

    Smith-Lovin and D

    L. Smith-Lovin and D. R. Heise. Analyzing social interaction. Advances in affect control theory. Gordon and Breach, 1988

  43. [43]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  44. [44]

    J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024

  45. [45]

    Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3481–3490, 2017

  46. [46]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  47. [47]

    J. W. Thibaut. The social psychology of groups. Routledge, 2017

  48. [48]

    H. C. Triandis. The self and social behavior in differing cultural contexts. Psychological review, 96(3):506, 1989

  49. [49]

    R. Wang, H. Yu, W. Zhang, Z. Qi, M. Sap, G. Neubig, Y . Bisk, and H. Zhu. Sotopia- pi: Interactive learning of socially intelligent language agents. arXiv preprint arXiv:2403.08715, 2024

  50. [50]

    W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, X. Gu, S. Huang, B. Xu, Y . Dong, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

  51. [51]

    A. Wilf, L. Mathur, S. Mathew, C. Ko, Y . Kebe, P. P. Liang, and L.-P. Morency. Social-iq 2.0 challenge: Benchmarking multimodal social understanding. https://github.com/abwilf/ Social-IQ-2.0-Challenge , 2023

  52. [52]

    H. Wu, X. Liu, C. C. Hagan, and D. Mobbs. Mentalizing during social interaction: A four component model. Cortex, 126:242–252, 2020

  53. [53]

    S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024. 12

  54. [54]

    N. Xu, L. Yang, Y . Fan, D. Yue, Y . Liang, J. Yang, and T. Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018

  55. [55]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  56. [56]

    video-subtitle-remover

    YaoFANGUK. video-subtitle-remover. https://github.com/YaoFANGUK/ video-subtitle-remover, 2025. Accessed: 2025-05-09

  57. [57]

    J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024

  58. [58]

    D. Yu, K. Sun, C. Cardie, and D. Yu. Dialogue-based relation extraction. arXiv preprint arXiv:2004.08056, 2020

  59. [59]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713

  60. [60]

    H. Zhao, A. Torralba, L. Torresani, and Z. Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019

  61. [61]

    J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

  62. [62]

    X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y . Bisk, D. Fried, G. Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667, 2023

  63. [64]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...