SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong; Song-Chun Zhu; Weiqin Zu; Xinyu Chen; Xue Feng; Yaodong Yang

arxiv: 2506.05425 · v3 · submitted 2025-06-05 · 💻 cs.CV · cs.AI

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong , Weiqin Zu , Xinyu Chen , Yaodong Yang , Song-Chun Zhu , Xue Feng This is my paper

Pith reviewed 2026-05-19 11:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video benchmarkmultimodal large language modelssocial interactionsocial relation theoryscene understandingstate reasoningdynamics prediction

0 comments

The pith

Multimodal language models grasp social scenes in videos but falter when inferring mental states or predicting behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SIV-Bench as a new video dataset to test how well multimodal large language models understand social interactions. It divides the task into three levels: recognizing visible elements in a scene, reasoning about hidden mental states and relationships, and forecasting what people will do next. Experiments across thousands of clips show models manage the first level reasonably well yet consistently struggle with the deeper reasoning steps. The main obstacle appears to be confusion when identifying which social relation applies between people. The benchmark draws on social relation theory and includes varied video styles and cultural contexts to expose these specific gaps.

Core claim

SIV-Bench supplies 2,792 video clips and 5,455 question-answer pairs that evaluate multimodal large language models on Social Scene Understanding, Social State Reasoning, and Social Dynamics Prediction. Models achieve better results on basic scene understanding yet remain weak on state reasoning and dynamics prediction, with systematic errors in relation inference emerging as the central limitation. Further examination links the shortfalls to misalignment with human reasoning patterns and insufficient depth in step-by-step inference, while audio and subtitles improve outcomes on the more demanding tasks.

What carries the argument

SIV-Bench, a collection of originally sourced video clips paired with questions generated through a human-LLM pipeline and organized around social relation theory to measure three progressive capabilities: scene understanding, state reasoning, and dynamics prediction.

If this is right

Audio and subtitles supply helpful signals specifically for the harder reasoning and prediction tasks.
Confusion over which relationship holds between people blocks progress on both state reasoning and behavior prediction.
Model outputs often diverge from the sequence of inferences humans follow when watching the same clips.
Performance gaps persist across different video lengths, genres, and linguistic backgrounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future models may need explicit modules for tracking and updating social relations over time rather than inferring them anew each frame.
The same evaluation approach could be applied to live video streams to test whether the observed weaknesses appear in real-time settings.
If relation inference improves, gains should appear first in the prediction tasks that depend on accurate relationship models.

Load-bearing premise

The question-answer pairs created by the human-LLM pipeline accurately represent ordinary human judgments about social relations, mental states, and behavior predictions.

What would settle it

Collect fresh human ratings on a random subset of the videos and compare them directly to the existing ground-truth answers; large mismatches would indicate the benchmark questions do not track typical human social judgments.

Figures

Figures reproduced from arXiv: 2506.05425 by Fanqi Kong, Song-Chun Zhu, Weiqin Zu, Xinyu Chen, Xue Feng, Yaodong Yang.

**Figure 2.** Figure 2: The SIV-Bench construction pipeline, detailing the data collection process (left), and the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Video statistics for SIV-Bench: (a) Distribution of social relation types. (b) Distribution of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the three subtitle conditions applied in SIV-Bench: ‘Origin’ (original video with existing onscreen text), ‘-Subtitle’ (original text removed), and ‘+Subtitle’ (transcribed and translated dialogue added). While most fall in the 10–20 second range, the dataset spans a wide distribution, including many short clips (under 10 seconds) and a significant number over 60 seconds. Linguistically… view at source ↗

**Figure 5.** Figure 5: Detailed statistics of QuestionAnswer pairs in SIV-Bench, showing the distribution across the 10 fine-grained sub-tasks. Subset 2: We recruit a team of 20 human annotators to verify the non-consensus items. Each QA-pair is reviewed by at least two people. We only retain questions for which all reviewing annotators independently and unanimously select the same answer option, which is then established as … view at source ↗

**Figure 6.** Figure 6: Radar chart illustrating the comparative [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Word clouds of GPT-generated keywords used for sourcing videos across 14 distinct social [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Examples illustrating the diverse video genres in SIV-Bench, including (from left to right) [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of diverse video presentation styles featured in SIV-Bench, including (from left [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt used to guide Gemini for the initial generation of question-answer pairs. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 12.** Figure 12: Prompt for the video-agnostic filtering of Question-Answer pairs. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used to score the difficulty of SIV-Bench QA pairs on a 1-to-5 scale, based [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used to instruct LLM for the final classification of Question-Answer pairs into one [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Screenshot of the guidelines provided to human annotators, detailing the tasks of answering [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Example of the web-based interface used by human annotators for watching videos, [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Statistical analysis of SIV-Bench Question-Answer (QA) pairs. (a) average word count [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Standardized prompt templates used for evaluating MLLMs on SIV-Bench. Separate [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Examples of failure cases in Social Scene Understanding (SSU) tasks, including errors in [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Examples of failure cases in Social State Reasoning (SSR) tasks, highlighting difficulties [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Examples of failure cases in Social Dynamics Prediction (SDP) tasks, covering both [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

read the original abstract

Understanding social interaction, which encompasses perceiving numerous and subtle multimodal cues, inferring unobservable mental states and relations, and dynamically predicting others' behavior, is the foundation for achieving human-machine interaction. Despite rapid advances in Multimodal Large Language Models (MLLMs), the rich and multifaceted nature of social interaction has hindered the development of benchmarks that holistically evaluate and guide their social interaction abilities. Based on social relation theory, which has been widely regarded as a foundational framework for understanding social behavior, we provide SIV-Bench, a novel video benchmark for systematically evaluating MLLMs' capabilities across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 originally collected video clips and 5,455 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It covers 14 typical relationships, diverse video lengths, genres, presentation styles, and linguistic and cultural backgrounds. Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck. An in-depth analysis of the reasoning process attributes MLLMs' suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. Moreover, we find audio and subtitles aid in reasoning-intensive SSR and SDP. Together, SIV-Bench offers a unified testbed to measure progress, expose limitations, and guide future research toward more socially intelligent MLLMs. We release the dataset and code at our project website: https://kfq20.github.io/sivbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIV-Bench brings a new video dataset and task split for social interaction in MLLMs, but the QA pairs lack reported quality checks that would strengthen the performance gap claims.

read the letter

The main thing here is a new benchmark with 2,792 original video clips and 5,455 QA pairs built around social relation theory. It splits evaluation into Social Scene Understanding, Social State Reasoning, and Social Dynamics Prediction, and the experiments show models handle basic scene tasks better than the reasoning and prediction ones, with relation inference as a repeated weak point. Audio and subtitles appear to help on the harder items.

Referee Report

1 major / 1 minor

Summary. The paper introduces SIV-Bench, a video benchmark with 2,792 originally collected clips and 5,455 QA pairs generated via a human-LLM collaborative pipeline. Grounded in social relation theory, it evaluates MLLMs across three tasks: Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP), covering 14 relationships and diverse video characteristics. Experiments indicate leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with systematic confusion in relation inference as a key bottleneck; audio and subtitles are shown to aid reasoning-intensive tasks, and analysis attributes suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. The dataset and code are released publicly.

Significance. If the QA pairs reliably instantiate social relation theory and match human judgments, the benchmark would provide a valuable unified testbed for measuring progress in MLLM social intelligence, exposing specific gaps in reasoning and relation inference while highlighting the utility of multimodal cues like audio. The public release of data and code supports reproducibility and further work toward socially aware models.

major comments (1)

Abstract: The headline finding that MLLMs are relatively strong on SSU yet weak on SSR/SDP (with relation inference as bottleneck) rests on the assumption that the human-LLM collaborative pipeline produces QA pairs that faithfully instantiate social relation theory and match human judgments. The abstract describes the pipeline and coverage of 14 relationships but supplies no inter-annotator agreement, expert adjudication rate, or comparison against purely human annotations for the 2,792 clips. Without these, systematic model errors could partly reflect annotation artifacts rather than genuine capability gaps, especially on reasoning-heavy SSR and SDP items.

minor comments (1)

The manuscript would benefit from a table or figure summarizing the distribution of the 5,455 QA pairs across SSU, SSR, and SDP categories to clarify task balance and coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comment raises a valid point about strengthening the evidence for the quality and human alignment of our QA pairs. We address this below and will incorporate revisions to improve the paper.

read point-by-point responses

Referee: Abstract: The headline finding that MLLMs are relatively strong on SSU yet weak on SSR/SDP (with relation inference as bottleneck) rests on the assumption that the human-LLM collaborative pipeline produces QA pairs that faithfully instantiate social relation theory and match human judgments. The abstract describes the pipeline and coverage of 14 relationships but supplies no inter-annotator agreement, expert adjudication rate, or comparison against purely human annotations for the 2,792 clips. Without these, systematic model errors could partly reflect annotation artifacts rather than genuine capability gaps, especially on reasoning-heavy SSR and SDP items.

Authors: We agree that explicit quantitative validation of the human-LLM pipeline is important to support our claims about model performance gaps. The full manuscript already details the multi-stage human oversight process (including expert review and correction of LLM-generated QA pairs), but we acknowledge that inter-annotator agreement metrics, adjudication rates, and direct human-only comparisons were not reported. In the revised manuscript we will add a dedicated quality validation subsection reporting: (1) inter-annotator agreement on a sampled subset of QA pairs, (2) expert adjudication statistics, and (3) consistency results from an independent human re-annotation study on a portion of the clips. We will also update the abstract to reference these validation steps. These additions will directly address the concern that annotation artifacts might explain observed model weaknesses. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark rests on external theory and new data collection

full rationale

The paper constructs SIV-Bench by collecting 2,792 new video clips and generating 5,455 QA pairs via a human-LLM pipeline explicitly grounded in external social relation theory. No equations, fitted parameters, or predictions appear in the provided text. Core claims about MLLM performance on SSU/SSR/SDP are empirical results measured against this independently created benchmark rather than any reduction to self-referential inputs or self-citation chains. The derivation chain is self-contained against external benchmarks and theory with no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that social relation theory adequately structures social interaction and that the human-LLM pipeline yields faithful ground-truth labels; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Social relation theory is a foundational framework for understanding social behavior
The benchmark is explicitly constructed around this theory as stated in the abstract.

pith-pipeline@v0.9.0 · 5837 in / 1296 out tokens · 76123 ms · 2026-05-19T11:32:45.286548+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SIV-Bench is fundamentally organized around social relationships, recognizing their critical role in shaping social interaction [47, 5, 16]. Specifically, SIV-Bench is built on Fiske’s Relational Models Theory [12], categorizing social interactions via four foundational models (Communal Sharing, Authority Ranking, Equality Matching, and Market Pricing)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
cs.CV 2026-05 unverdicted novelty 7.0

GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 2 Pith papers · 16 internal anchors

[1]

I. M. Alabdulmohsin, B. Neyshabur, and X. Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022

work page 2022
[2]

Amiri-Margavi, I

A. Amiri-Margavi, I. Jebellat, E. Jebellat, and S. P. M. Davoudi. Enhancing answer reliability through inter-model consensus of large language models. arXiv preprint arXiv:2411.16797, 2024

work page arXiv 2024
[3]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Berger, B

J. Berger, B. P. Cohen, and M. Zelditch Jr. Status characteristics and social interaction.American sociological review, pages 241–255, 1972

work page 1972
[5]

I. Burkitt. Social relationships and emotions. Sociology, 31(1):37–55, 1997

work page 1997
[6]

R. M. Byrne. Counterfactual thought. Annual review of psychology, 67(1):135–157, 2016

work page 2016
[7]

H. Chen, W. Ji, L. Xu, and S. Zhao. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151, 2023

work page arXiv 2023
[8]

J. C.-Y . Chen, S. Saha, and M. Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023

work page arXiv 2023
[9]

T. Doshi. Build rich, interactive web apps with an updated gemini 2.5 pro. https://blog. google/products/gemini/gemini-2-5-pro-updates/ , May 2025. Accessed: 2025-05- 09

work page 2025
[10]

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024
[11]

X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen. Mmbench-video: A long- form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024
[12]

A. P. Fiske. The four elementary forms of sociality: framework for a unified theory of social relations. Psychological review, 99(4):689, 1992

work page 1992
[13]

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Start building with Gemini 2.5 Flash

Google. Start building with Gemini 2.5 Flash. Google for Develop- ers Blog, Apr. 2025. URL https://developers.googleblog.com/en/ start-building-with-gemini-25-flash/ . Accessed: May 9, 2025

work page 2025
[15]

Gemini 2.0: Flash, flash-lite and pro

Google. Gemini 2.0: Flash, flash-lite and pro. https://developers.googleblog.com/en/ gemini-2-family-expands/ , February 2025. Accessed: 2025-05-09

work page 2025
[16]

W. W. Hartup. Social relationships and their developmental significance.American psychologist, 44(2):120, 1989

work page 1989
[17]

L. Hong, Z. Liu, W. Chen, C. Tan, Y . Feng, X. Zhou, P. Guo, J. Li, Z. Chen, S. Gao, et al. Lvos: A benchmark for large-scale long-term video object segmentation. arXiv preprint arXiv:2404.19326, 2024. 10

work page arXiv 2024
[18]

G. Hou, W. Zhang, Y . Shen, Z. Tan, S. Shen, and W. Lu. Entering real social world! bench- marking the theory of mind and socialization capabilities of llms from a first-person perspective. arXiv preprint arXiv:2410.06195, 2024

work page arXiv 2024
[19]

Huang, X

Y . Huang, X. Wang, H. Liu, F. Kong, A. Qin, M. Tang, X. Wang, S.-C. Zhu, M. Bi, S. Qi, et al. Adasociety: An adaptive environment with social structures for multi-agent decision-making. arXiv preprint arXiv:2411.03865, 2024

work page arXiv 2024
[20]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Y . Jin, M. Choi, G. Verma, J. Wang, and S. Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. arXiv preprint arXiv:2402.14154, 2024

work page arXiv 2024
[22]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[25]

Y . Li, X. Chen, B. Hu, L. Wang, H. Shi, and M. Zhang. Videovista: A versatile benchmark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024

work page arXiv 2024
[26]

X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3566–3574, 2019

work page 2019
[27]

Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Mangalam, R

K. Mangalam, R. Akshulakov, and J. Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023

work page 2023
[29]

Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

L. Mathur, M. Qian, P. P. Liang, and L.-P. Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025

work page arXiv 2025
[30]

D. W. Maynard and A. Peräkylä. Language and social interaction. In Handbook of social psychology, pages 233–257. Springer, 2003

work page 2003
[31]

Mollahosseini, B

A. Mollahosseini, B. Hasani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1): 18–31, 2017

work page 2017
[32]

M. Ning, B. Zhu, Y . Xie, B. Lin, J. Cui, L. Yuan, D. Chen, and L. Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023

work page arXiv 2023
[33]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. Ac- cessed: 2025-05-09. The direct PDF can also be found at https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

work page 2025
[34]

Patraucean, L

V . Patraucean, L. Smaira, A. Gupta, A. Recasens, L. Markeeva, D. Banarse, S. Koppula, M. Malinowski, Y . Yang, C. Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36:42748– 42761, 2023

work page 2023
[35]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018. 11

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

J. Qi, J. Yu, T. Tu, K. Gao, Y . Xu, X. Guan, X. Wang, B. Xu, L. Hou, J. Li, et al. Goal: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 5391–5395, 2023

work page 2023
[37]

Qiang, Y

R. Qiang, Y . Zhuang, Y . Li, D. S. V . K, R. Zhang, C. Li, I. S.-H. Wong, S. Yang, P. Liang, C. Zhang, and B. Dai. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering, 2025. URL https://arxiv.org/abs/2505.07782

work page arXiv 2025
[38]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.org/abs/2212. 04356

work page 2022
[39]

Ramnani and R

N. Ramnani and R. C. Miall. A system in the human brain for predicting the actions of others. Nature neuroscience, 7(1):85–90, 2004

work page 2004
[40]

Cinepile: A long video question answering dataset and benchmark

R. Rawal, K. Saifullah, M. Farré, R. Basri, D. Jacobs, G. Somepalli, and T. Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024

work page arXiv 2024
[41]

Shinoda, N

K. Shinoda, N. Hojo, K. Nishida, S. Mizuno, K. Suzuki, R. Masumura, H. Sugiyama, and K. Saito. Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind. arXiv preprint arXiv:2501.08838, 2025

work page arXiv 2025
[42]

Smith-Lovin and D

L. Smith-Lovin and D. R. Heise. Analyzing social interaction. Advances in affect control theory. Gordon and Breach, 1988

work page 1988
[43]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[44]

J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024

work page 2024
[45]

Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3481–3490, 2017

work page 2017
[46]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

J. W. Thibaut. The social psychology of groups. Routledge, 2017

work page 2017
[48]

H. C. Triandis. The self and social behavior in differing cultural contexts. Psychological review, 96(3):506, 1989

work page 1989
[49]

R. Wang, H. Yu, W. Zhang, Z. Qi, M. Sap, G. Neubig, Y . Bisk, and H. Zhu. Sotopia- pi: Interactive learning of socially intelligent language agents. arXiv preprint arXiv:2403.08715, 2024

work page arXiv 2024
[50]

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, X. Gu, S. Huang, B. Xu, Y . Dong, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

work page internal anchor Pith review arXiv 2024
[51]

A. Wilf, L. Mathur, S. Mathew, C. Ko, Y . Kebe, P. P. Liang, and L.-P. Morency. Social-iq 2.0 challenge: Benchmarking multimodal social understanding. https://github.com/abwilf/ Social-IQ-2.0-Challenge , 2023

work page 2023
[52]

H. Wu, X. Liu, C. C. Hagan, and D. Mobbs. Mentalizing during social interaction: A four component model. Cortex, 126:242–252, 2020

work page 2020
[53]

S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024. 12

work page 2024
[54]

N. Xu, L. Yang, Y . Fan, D. Yue, Y . Liang, J. Yang, and T. Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

video-subtitle-remover

YaoFANGUK. video-subtitle-remover. https://github.com/YaoFANGUK/ video-subtitle-remover, 2025. Accessed: 2025-05-09

work page 2025
[57]

J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024

work page internal anchor Pith review arXiv 2024
[58]

D. Yu, K. Sun, C. Cardie, and D. Yu. Dialogue-based relation extraction. arXiv preprint arXiv:2004.08056, 2020

work page arXiv 2004
[59]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

H. Zhao, A. Torralba, L. Torresani, and Z. Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019

work page 2019
[61]

J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y . Bisk, D. Fried, G. Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667, 2023

work page arXiv 2023
[64]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

I. M. Alabdulmohsin, B. Neyshabur, and X. Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022

work page 2022

[2] [2]

Amiri-Margavi, I

A. Amiri-Margavi, I. Jebellat, E. Jebellat, and S. P. M. Davoudi. Enhancing answer reliability through inter-model consensus of large language models. arXiv preprint arXiv:2411.16797, 2024

work page arXiv 2024

[3] [3]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Berger, B

J. Berger, B. P. Cohen, and M. Zelditch Jr. Status characteristics and social interaction.American sociological review, pages 241–255, 1972

work page 1972

[5] [5]

I. Burkitt. Social relationships and emotions. Sociology, 31(1):37–55, 1997

work page 1997

[6] [6]

R. M. Byrne. Counterfactual thought. Annual review of psychology, 67(1):135–157, 2016

work page 2016

[7] [7]

H. Chen, W. Ji, L. Xu, and S. Zhao. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151, 2023

work page arXiv 2023

[8] [8]

J. C.-Y . Chen, S. Saha, and M. Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023

work page arXiv 2023

[9] [9]

T. Doshi. Build rich, interactive web apps with an updated gemini 2.5 pro. https://blog. google/products/gemini/gemini-2-5-pro-updates/ , May 2025. Accessed: 2025-05- 09

work page 2025

[10] [10]

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024

[11] [11]

X. Fang, K. Mao, H. Duan, X. Zhao, Y . Li, D. Lin, and K. Chen. Mmbench-video: A long- form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024

[12] [12]

A. P. Fiske. The four elementary forms of sociality: framework for a unified theory of social relations. Psychological review, 99(4):689, 1992

work page 1992

[13] [13]

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Start building with Gemini 2.5 Flash

Google. Start building with Gemini 2.5 Flash. Google for Develop- ers Blog, Apr. 2025. URL https://developers.googleblog.com/en/ start-building-with-gemini-25-flash/ . Accessed: May 9, 2025

work page 2025

[15] [15]

Gemini 2.0: Flash, flash-lite and pro

Google. Gemini 2.0: Flash, flash-lite and pro. https://developers.googleblog.com/en/ gemini-2-family-expands/ , February 2025. Accessed: 2025-05-09

work page 2025

[16] [16]

W. W. Hartup. Social relationships and their developmental significance.American psychologist, 44(2):120, 1989

work page 1989

[17] [17]

L. Hong, Z. Liu, W. Chen, C. Tan, Y . Feng, X. Zhou, P. Guo, J. Li, Z. Chen, S. Gao, et al. Lvos: A benchmark for large-scale long-term video object segmentation. arXiv preprint arXiv:2404.19326, 2024. 10

work page arXiv 2024

[18] [18]

G. Hou, W. Zhang, Y . Shen, Z. Tan, S. Shen, and W. Lu. Entering real social world! bench- marking the theory of mind and socialization capabilities of llms from a first-person perspective. arXiv preprint arXiv:2410.06195, 2024

work page arXiv 2024

[19] [19]

Huang, X

Y . Huang, X. Wang, H. Liu, F. Kong, A. Qin, M. Tang, X. Wang, S.-C. Zhu, M. Bi, S. Qi, et al. Adasociety: An adaptive environment with social structures for multi-agent decision-making. arXiv preprint arXiv:2411.03865, 2024

work page arXiv 2024

[20] [20]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Y . Jin, M. Choi, G. Verma, J. Wang, and S. Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. arXiv preprint arXiv:2402.14154, 2024

work page arXiv 2024

[22] [22]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024

[25] [25]

Y . Li, X. Chen, B. Hu, L. Wang, H. Shi, and M. Zhang. Videovista: A versatile benchmark for video understanding and reasoning. arXiv preprint arXiv:2406.11303, 2024

work page arXiv 2024

[26] [26]

X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3566–3574, 2019

work page 2019

[27] [27]

Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Mangalam, R

K. Mangalam, R. Akshulakov, and J. Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023

work page 2023

[29] [29]

Social genome: Grounded social reasoning abilities of multimodal models.arXiv preprint arXiv:2502.15109, 2025

L. Mathur, M. Qian, P. P. Liang, and L.-P. Morency. Social genome: Grounded social reasoning abilities of multimodal models. arXiv preprint arXiv:2502.15109, 2025

work page arXiv 2025

[30] [30]

D. W. Maynard and A. Peräkylä. Language and social interaction. In Handbook of social psychology, pages 233–257. Springer, 2003

work page 2003

[31] [31]

Mollahosseini, B

A. Mollahosseini, B. Hasani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1): 18–31, 2017

work page 2017

[32] [32]

M. Ning, B. Zhu, Y . Xie, B. Lin, J. Cui, L. Yuan, D. Chen, and L. Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023

work page arXiv 2023

[33] [33]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. Ac- cessed: 2025-05-09. The direct PDF can also be found at https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf

work page 2025

[34] [34]

Patraucean, L

V . Patraucean, L. Smaira, A. Gupta, A. Recasens, L. Markeeva, D. Banarse, S. Koppula, M. Malinowski, Y . Yang, C. Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36:42748– 42761, 2023

work page 2023

[35] [35]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018. 11

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

J. Qi, J. Yu, T. Tu, K. Gao, Y . Xu, X. Guan, X. Wang, B. Xu, L. Hou, J. Li, et al. Goal: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 5391–5395, 2023

work page 2023

[37] [37]

Qiang, Y

R. Qiang, Y . Zhuang, Y . Li, D. S. V . K, R. Zhang, C. Li, I. S.-H. Wong, S. Yang, P. Liang, C. Zhang, and B. Dai. Mle-dojo: Interactive environments for empowering llm agents in machine learning engineering, 2025. URL https://arxiv.org/abs/2505.07782

work page arXiv 2025

[38] [38]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.org/abs/2212. 04356

work page 2022

[39] [39]

Ramnani and R

N. Ramnani and R. C. Miall. A system in the human brain for predicting the actions of others. Nature neuroscience, 7(1):85–90, 2004

work page 2004

[40] [40]

Cinepile: A long video question answering dataset and benchmark

R. Rawal, K. Saifullah, M. Farré, R. Basri, D. Jacobs, G. Somepalli, and T. Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024

work page arXiv 2024

[41] [41]

Shinoda, N

K. Shinoda, N. Hojo, K. Nishida, S. Mizuno, K. Suzuki, R. Masumura, H. Sugiyama, and K. Saito. Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind. arXiv preprint arXiv:2501.08838, 2025

work page arXiv 2025

[42] [42]

Smith-Lovin and D

L. Smith-Lovin and D. R. Heise. Analyzing social interaction. Advances in affect control theory. Gordon and Breach, 1988

work page 1988

[43] [43]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[44] [44]

J. W. Strachan, D. Albergo, G. Borghini, O. Pansardi, E. Scaliti, S. Gupta, K. Saxena, A. Rufo, S. Panzeri, G. Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295, 2024

work page 2024

[45] [45]

Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3481–3490, 2017

work page 2017

[46] [46]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

J. W. Thibaut. The social psychology of groups. Routledge, 2017

work page 2017

[48] [48]

H. C. Triandis. The self and social behavior in differing cultural contexts. Psychological review, 96(3):506, 1989

work page 1989

[49] [49]

R. Wang, H. Yu, W. Zhang, Z. Qi, M. Sap, G. Neubig, Y . Bisk, and H. Zhu. Sotopia- pi: Interactive learning of socially intelligent language agents. arXiv preprint arXiv:2403.08715, 2024

work page arXiv 2024

[50] [50]

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, X. Gu, S. Huang, B. Xu, Y . Dong, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

work page internal anchor Pith review arXiv 2024

[51] [51]

A. Wilf, L. Mathur, S. Mathew, C. Ko, Y . Kebe, P. P. Liang, and L.-P. Morency. Social-iq 2.0 challenge: Benchmarking multimodal social understanding. https://github.com/abwilf/ Social-IQ-2.0-Challenge , 2023

work page 2023

[52] [52]

H. Wu, X. Liu, C. C. Hagan, and D. Mobbs. Mentalizing during social interaction: A four component model. Cortex, 126:242–252, 2020

work page 2020

[53] [53]

S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024. 12

work page 2024

[54] [54]

N. Xu, L. Yang, Y . Fan, D. Yue, Y . Liang, J. Yang, and T. Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

video-subtitle-remover

YaoFANGUK. video-subtitle-remover. https://github.com/YaoFANGUK/ video-subtitle-remover, 2025. Accessed: 2025-05-09

work page 2025

[57] [57]

J. Ye, H. Xu, H. Liu, A. Hu, M. Yan, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024

work page internal anchor Pith review arXiv 2024

[58] [58]

D. Yu, K. Sun, C. Cardie, and D. Yu. Dialogue-based relation extraction. arXiv preprint arXiv:2004.08056, 2020

work page arXiv 2004

[59] [59]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li. Video instruction tuning with synthetic data, 2024. URL https://arxiv.org/abs/2410.02713

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

H. Zhao, A. Torralba, L. Torresani, and Z. Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019

work page 2019

[61] [61]

J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y . Bisk, D. Fried, G. Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667, 2023

work page arXiv 2023

[63] [64]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

work page internal anchor Pith review Pith/arXiv arXiv 2025