Social Caption: Evaluating Social Understanding in Multimodal Models
read the original abstract
Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce SOCIAL CAPTION, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to generate relevant information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges demonstrate a path towards scaling automated evaluation of multimodal social understanding.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
-
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
VLMs are evaluated on gaze following and social gaze prediction using existing datasets in zero-shot and fine-tuned settings, revealing they currently lack precise capabilities compared to visual models.
-
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
EyeVLM benchmark finds that current VLMs underperform specialized visual models on gaze following and social gaze prediction, with fine-tuning narrowing but not closing the gap.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.