Social Caption: Evaluating Social Understanding in Multimodal Models

Bhaavanaa Thumu; Leena Mathur; Louis-Philippe Morency; Youssouf Kebe

arxiv: 2601.14569 · v2 · pith:XSR2IGQPnew · submitted 2026-01-21 · 💻 cs.CL · cs.LG

Social Caption: Evaluating Social Understanding in Multimodal Models

Leena Mathur , Bhaavanaa Thumu , Youssouf Kebe , Louis-Philippe Morency This is my paper

classification 💻 cs.CL cs.LG

keywords socialunderstandinginteractionsabilitymultimodalabilitiesanalysiscaption

0 comments

read the original abstract

Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce SOCIAL CAPTION, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to generate relevant information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges demonstrate a path towards scaling automated evaluation of multimodal social understanding.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
cs.CV 2026-05 unverdicted novelty 7.0

GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
cs.CV 2026-05 unverdicted novelty 5.0

VLMs are evaluated on gaze following and social gaze prediction using existing datasets in zero-shot and fine-tuned settings, revealing they currently lack precise capabilities compared to visual models.
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
cs.CV 2026-05 unverdicted novelty 5.0

EyeVLM benchmark finds that current VLMs underperform specialized visual models on gaze following and social gaze prediction, with fine-tuning narrowing but not closing the gap.