pith. sign in

arxiv: 2605.15764 · v1 · pith:T6G7VVPXnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Pith reviewed 2026-05-20 18:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords social reasoninggaze trackingdeictic gesturesmulti-person videomultimodal LLMsgrounding rewardsocial QA dataset
0
0 comments X

The pith

GRASP dataset and Social Grounding Reward link high-level social questions to specific gaze and gesture events in multi-person videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRASP, a dataset of 290K question-answer pairs across 46K videos that builds social reasoning tasks directly from identity-consistent gaze trajectories and deictic gestures. It pairs this resource with GRASP-Bench and proposes Social Grounding Reward (SGR) as a training signal that encourages models to identify the participants in each interaction. Experiments indicate that SGR raises accuracy on GRASP-Bench while preserving zero-shot results on existing social video QA benchmarks. A sympathetic reader cares because current multimodal models routinely misidentify who is interacting with whom when non-verbal cues are subtle and multiple people are present.

Core claim

By constructing questions from fine-grained, identity-consistent gaze trajectories and deictic gestures organized into a 16-category taxonomy, and by applying Social Grounding Reward during training, multimodal models improve their ability to ground social reasoning in the actual participants and events shown in multi-person videos.

What carries the argument

GRASP dataset built from gaze trajectories and deictic gestures, paired with the Social Grounding Reward (SGR) learning signal that reinforces participant identification in social events.

If this is right

  • Models become better at determining which people are involved in each social event within crowded scenes.
  • The 16-category taxonomy supplies structured supervision that can be reused across different video lengths and interaction types.
  • Training with SGR leaves general social video question-answering performance intact in the zero-shot regime.
  • The approach scales to 749 hours of video while remaining compatible with existing multimodal large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grounding technique could be tested on live camera feeds to support real-time social awareness in robots or meeting assistants.
  • Extending the taxonomy to additional non-verbal signals such as posture or proximity might further tighten the link between cues and social meaning.
  • If the dataset's construction method generalizes, it offers a template for building grounded reasoning resources in other domains that mix perception and high-level inference.

Load-bearing premise

The videos and questions constructed from gaze and gesture events accurately represent real-world social interactions without meaningful selection or annotation bias.

What would settle it

Training models with SGR produces no measurable gain on GRASP-Bench or causes clear drops on zero-shot social video QA benchmarks.

Figures

Figures reproduced from arXiv: 2605.15764 by Ana Jojic, Bikram Boote, Bolin Lai, Fiona Ryan, Houze Yang, James M. Rehg, Junho Kim, Sangmin Lee, Xu Cao.

Figure 1
Figure 1. Figure 1: Example from GRASP. Multi-person social reasoning requires grounding subtle non-verbal cues in the correct participants over time. Existing MLLMs [19, 78] often take spurious scene-level shortcuts, whereas ours leverage evidence-aware supervision to reason from the relevant social event. The key hypothesis underlying this work is that modern MLLMs [52, 14, 51], which integrate visual perception with strong… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the GRASP construction pipeline and QA examples. We convert multi-person videos into person-consistent gaze and gesture events, compose them into structured social QA pairs, and apply subset validation with human feedback for quality control. additional online video sources, as such contents contain dense multi-person interactions with rich social signals. Our dataset comprises 46K videos, from… view at source ↗
Figure 3
Figure 3. Figure 3: GRASP taxonomy and statistics. Social Reasoning QA Generation. QA pairs are gen￾erated from structured event metadata derived from gaze and gesture interactions using a closed-source model [24]. Each question is constructed by querying key attributes such as participant identities, temporal intervals, and interaction types, ensuring that answers are directly verifiable without exhaustive manual anno￾tation… view at source ↗
Figure 5
Figure 5. Figure 5: Grounded participant precision—accuracy on GRASP-Bench across various reasoning baselines. Marker size reflects the average number of novel par￾ticipants mentioned in the reasoning trace. often produce verbose but ungrounded reasoning traces that fail to identify the relevant individuals involved in social interactions. As shown in Tab. 3, incorporating GRPO on top of the baseline yields moderate improveme… view at source ↗
Figure 6
Figure 6. Figure 6: Human validation interface. Evaluators inspect each QA instance with the corresponding [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dataset-level scale across the six source domains. We report the number of source videos, [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of retained gaze event types and gesture types. Gaze events are filtered at [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: QA category distribution. The benchmark contains 16 categories: six gaze categories [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Data construction yield. The pipeline starts from source videos, detects raw gaze and [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average gaze and gesture event density per video. Social Deduction Game has the highest [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of MCQ training examples by modality and difficulty. Open-ended examples [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy compared against average reasoning length for all baselines. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Error profile for SGR compared against difficulty of the questions for the base models. To understand the difficulty of the GRASP-Bench tasks, we group cases into buckets defined by how many base models were correct, and the accuracy of the our models per bucket was computed as shown in [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison on an easy GRASP-Bench example. VL-Rethinker-7B Qwen3VL + SGR <think>… The question asks how long Person 3’s gesture lasts, starting at 1.0 seconds… To answer, I should track the gesture type and the time span from the start until it ends, then compare that duration with the options… <gesture>From 1s to 7s, Person 3 pointed at Person 4.</gesture> The gesture begins at 1.0 seconds an… view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative comparison on a medium GRASP-Bench example. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison on a hard GRASP-Bench example. Qwen3VL-8B Qwen3VL + SGR <think> So, let's try to figure out this problem… The question is about identifying which player P2 is talking to in the sentence “Yeah. You have to admit that if you want to- (To <PREDICT>)”. We need to look at the gaze and gesture interactions in the video. First, let's recall the conversation context. The target utterance is… view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative comparison on MMSI speaker target identification (STI). [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative comparison on MMSI pronoun coreference resolution (PCR). [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative comparison on MMSI mentioned-player prediction (MPP). [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative comparison on TVQA+. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Failure cases on GRASP-Bench. We show two representative errors: an ambiguous deictic gesture where reaching and pointing cues are visually close, and a crowded gaze-reasoning case where multiple gaze events occur within the target interval. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Qualitative GRASP-Bench examples for gaze reasoning, covering T1–T6. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Qualitative GRASP-Bench examples for gesture reasoning, covering G1–G6. t = 38.5s t = 41.5s Question (J4, hard): t = 3.5s t = 42.0s t = 42.5s t = 43.0s eye contact Person2 points at Person1 twice in this video, around 3.5 to 42.5 s. During which pointing gesture do Person2 and Person1 make eye contact? Options: A) Only during the first pointing gesture (2.5s – 4.0s) B) Only during the second pointing gest… view at source ↗
Figure 25
Figure 25. Figure 25: Qualitative GRASP-Bench examples for joint gaze–gesture reasoning, covering J1–J4. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt for deictic gesture annotation. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prompt QA generation. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗
read the original abstract

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GRASP, a large-scale dataset containing 290K question-answer pairs derived from 46K multi-person videos (749 hours total), organized under a 16-category taxonomy spanning gaze, gesture, and joint gaze-gesture reasoning. It proposes the Social Grounding Reward (SGR) as a learning signal that leverages identity-consistent social events to encourage models to ground interactions by identifying participants. Experiments report that SGR improves performance on the introduced GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

Significance. If the empirical results are substantiated with rigorous controls, this work would provide a valuable large-scale resource and training mechanism for advancing multimodal large language models in fine-grained social reasoning over non-verbal cues, addressing a notable gap between isolated cue detection and high-level social QA.

major comments (2)
  1. [§5] §5 (Experiments): The reported performance improvements from SGR on GRASP-Bench are presented without details on the specific baselines compared, the train/validation/test splits employed, ablation studies isolating the reward component, or statistical significance testing, which are required to establish the robustness of the central empirical claim.
  2. [§3.2] §3.2 (Dataset Construction): The process of selecting videos based on identity-consistent gaze trajectories and deictic gestures, followed by composing QA pairs under the 16-category taxonomy, risks introducing selection bias toward clear, trackable interactions; this could inflate SGR gains on GRASP-Bench without ensuring generalization to ambiguous, occluded, or culturally diverse real-world scenes.
minor comments (2)
  1. [Abstract] Abstract: Expand to name the specific related social video QA benchmarks used for the zero-shot evaluation to provide immediate context for the preservation claim.
  2. [§4] §4 (Method): Clarify the exact formulation of the SGR loss or reward function, including any hyperparameters, to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, outlining how we will strengthen the presentation of experiments and dataset construction.

read point-by-point responses
  1. Referee: §5 (Experiments): The reported performance improvements from SGR on GRASP-Bench are presented without details on the specific baselines compared, the train/validation/test splits employed, ablation studies isolating the reward component, or statistical significance testing, which are required to establish the robustness of the central empirical claim.

    Authors: We agree that the current experimental section would benefit from greater detail to substantiate the central claims. In the revised manuscript we will expand §5 to explicitly list the baseline models and methods, describe the train/validation/test splits used for GRASP-Bench, present ablation studies that isolate the contribution of the Social Grounding Reward, and report statistical significance testing (e.g., paired t-tests or bootstrap intervals) for the observed improvements. revision: yes

  2. Referee: §3.2 (Dataset Construction): The process of selecting videos based on identity-consistent gaze trajectories and deictic gestures, followed by composing QA pairs under the 16-category taxonomy, risks introducing selection bias toward clear, trackable interactions; this could inflate SGR gains on GRASP-Bench without ensuring generalization to ambiguous, occluded, or culturally diverse real-world scenes.

    Authors: The emphasis on identity-consistent trajectories is deliberate: it enables reliable construction of QA pairs that link fine-grained non-verbal events to specific participants, which is the core motivation for both GRASP and SGR. We acknowledge that this design choice favors clearer interactions and may affect generalization. In the revision we will add an explicit limitations paragraph in §3.2 that discusses selection bias, ambiguous/occluded cases, and cultural diversity, together with qualitative examples illustrating the dataset's coverage. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical dataset construction (GRASP with 290K QA pairs from gaze/gesture events) and a reward signal (SGR) defined directly from those events to train models for social reasoning. No equations, parameter fits, or derivations are described that reduce a claimed prediction or result to the inputs by construction. Central claims rest on experimental performance lifts on GRASP-Bench and zero-shot retention elsewhere, which are falsifiable benchmarks rather than self-referential. No load-bearing self-citations or uniqueness theorems are invoked in the provided text to justify the method. The approach is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions from multimodal learning and video annotation; SGR is introduced as a new training signal without independent external validation.

axioms (1)
  • domain assumption Identity-consistent gaze trajectories and deictic gestures can be reliably extracted and used to generate social QA pairs
    Invoked in the description of how questions are built from events.
invented entities (1)
  • Social Grounding Reward (SGR) no independent evidence
    purpose: Learning signal that uses social events to encourage models to reason about interaction participants
    Newly proposed in this work to train on the GRASP dataset.

pith-pipeline@v0.9.0 · 5738 in / 1248 out tokens · 32273 ms · 2026-05-20T18:45:50.370744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 16 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  2. [2]

    System card: Claude sonnet 4.6

    Anthropic. System card: Claude sonnet 4.6. https://www.anthropic.com/ claude-haiku-4-5-system-card, feb 2026. Official system card

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Escnet: Gaze target detection with the understanding of 3d scenes

    Jun Bao, Buyu Liu, and Jun Yu. Escnet: Gaze target detection with the understanding of 3d scenes. In CVPR, pages 14126–14135, 2022

  6. [6]

    Tonko EW Bossen, Andreas Møgelmose, and Ross Greer. Can vision-language models understand and interpret dynamic gestures from pedestrians? pilot datasets and exploration towards instructive nonverbal commands for cooperative autonomous vehicles. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4779–4788, 2025

  7. [7]

    Socialgesture: Delving into multi-person gesture understanding

    Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan, Sangmin Lee, and James M Rehg. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19509–19519, 2025

  8. [8]

    Toward human deictic gesture target estimation

    Xu Cao, Pranav Virupaksha, Sangmin Lee, Bolin Lai, Wenqi Jia, Jintai Chen, and James Matthew Rehg. Toward human deictic gesture target estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  9. [9]

    Gaze target estimation anywhere with concepts

    Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, and James M Rehg. Gaze target estimation anywhere with concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  10. [10]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  11. [11]

    arXiv preprint arXiv:2602.13517 , year=

    Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

  12. [12]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

  13. [13]

    Detecting attended visual targets in video

    Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M Rehg. Detecting attended visual targets in video. InCVPR, pages 5396–5406, 2020

  14. [14]

    InstructBLIP: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023

  15. [15]

    Retinaface: Single-shot multi-level face localisation in the wild

    Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

  16. [16]

    Inferring shared attention in social scene videos

    Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. Inferring shared attention in social scene videos. InCVPR, pages 6460–6468, 2018

  17. [17]

    Understanding human gaze communication by spatio-temporal graph reasoning

    Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5724–5733, 2019

  18. [18]

    Dual attention guided gaze target detection in the wild

    Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. Dual attention guided gaze target detection in the wild. InCVPR, pages 11390–11399, 2021

  19. [19]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 10

  20. [20]

    Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012

    Chris D Frith and Uta Frith. Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012

  21. [21]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  22. [22]

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

  23. [23]

    Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021

    Émilie Gagnon-St-Pierre, Marina M Doucerain, and Henry Markovits. Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021

  24. [24]

    Gemini 3.1 pro model card

    Google Deepmind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, feb 2026. Official system card

  25. [25]

    Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024

    Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, and Jean-marc Odobez. Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024

  26. [26]

    A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings

    Anshul Gupta, Samy Tafasca, and Jean-Marc Odobez. A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings. InCVPRW, pages 5041–5050, 2022

  27. [27]

    Exploring the zero-shot capabilities of vision-language models for improving gaze following

    Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, and Jean-Marc Odobez. Exploring the zero-shot capabilities of vision-language models for improving gaze following. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 615–624, 2024

  28. [28]

    Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019

    Judith A Hall, Terrence G Horgan, and Nora A Murphy. Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019

  29. [29]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  30. [30]

    Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations

    Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, and Mike Shou. Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10462–10479, 2023

  31. [31]

    Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022

    Tianlei Jin, Qizhi Yu, Shiqiang Zhu, Zheyuan Lin, Jie Ren, Yuanhai Zhou, and Wei Song. Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022

  32. [32]

    social gaze space

    Mathis Jording, Arne Hartz, Gary Bente, Martin Schulte-Rüther, and Kai V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in psychology, 9:226, 2018

  33. [33]

    Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

    Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions. arXiv preprint arXiv:2511.16221, 2025

  34. [34]

    Hagrid–hand gesture recognition image dataset

    Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid–hand gesture recognition image dataset. InWACV, pages 4572–4581, 2024

  35. [35]

    Kobin H Kendrick, Judith Holler, and Stephen C Levinson. Turn-taking in human face-to-face interaction is multimodal: gaze direction and manual gestures aid the coordination of turn transitions.Philosophical transactions of the royal society B, 378(1875):20210473, 2023

  36. [36]

    Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024

    Junho Kim, Hyunjun Kim, Hosu Lee, and Yong Man Ro. Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024

  37. [37]

    SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

    Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, and Xue Feng. Siv-bench: A video benchmark for social interaction understanding and reasoning.arXiv preprint arXiv:2506.05425, 2025

  38. [38]

    The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

    Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

  39. [39]

    Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games

    Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James Rehg, and Diyi Yang. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. InFindings of ACL, pages 6570–6588, 2023

  40. [40]

    Modeling multimodal social interactions: new challenges and baselines with densely aligned representations

    Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, and James M Rehg. Modeling multimodal social interactions: new challenges and baselines with densely aligned representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14585–14595, 2024

  41. [41]

    Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024

    Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan, Xu Cao, Ozgur Kara, Bikram Boote, Weiyan Shi, Diyi Yang, et al. Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024

  42. [42]

    Tvqa: Localized, compositional video question answering

    Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018

  43. [43]

    Tvqa+: Spatio-temporal grounding for video question answering

    Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020

  44. [44]

    Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025

    Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025. 11

  45. [45]

    Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

    Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M Rehg, and Yapeng Tian. Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

  46. [46]

    Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026

    Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, and Yapeng Tian. Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026

  47. [47]

    In the eye of beholder: Joint learning of gaze and actions in first person video

    Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, pages 619–635, 2018

  48. [48]

    Zhuoming Li, Aitong Liu, Mengxi Jia, Yubo Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, and Xuelong Li. Gestura: A lvlm-powered system bridging motion and semantics for real-time free-form gesture understanding.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(4):1–29, 2025

  49. [49]

    V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning

    Zongyu Lin, Zhikun Xu, Xiaohan Song, Yixin Wan, Xingcheng Yao, Tsung-Han Lin, Selina Song, Pranav Subbaraman, Ben Zhou, Kai-Wei Chang, et al. V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19025–19047, 2025

  50. [50]

    Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition

    Dan Liu, Libo Zhang, and Yanjun Wu. Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition. InCVPR, pages 3304–3312, 2022

  51. [51]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

  52. [52]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

  53. [53]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  54. [54]

    here’s looking at you, kid

    Manuel Marin-Jimenez, Andrew Zisserman, and Vittorio Ferrari. " here’s looking at you, kid": Detecting people looking at each other in videos. InBMVC. British Machine Vision Association and Society for Pattern Recognition, 2011

  55. [55]

    Gazevlm: A vision-language model for multi-task gaze understanding,

    Athul M Mathew, Haithem Hermassi, Thariq Khalid, and Arshad Ali Khan. Gazevlm: A vision-language model for multi-task gaze understanding.arXiv preprint arXiv:2511.06348, 2025

  56. [56]

    Social genome: Grounded social reasoning abilities of multimodal models

    Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24879–24902, 2025

  57. [57]

    arXiv preprint arXiv:2510.16258 , year=

    Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

  58. [58]

    University of Chicago press, 1992

    David McNeill.Hand and mind: What gestures reveal about thought. University of Chicago press, 1992

  59. [59]

    Psychology Press, 2014

    Chris Moore, Philip J Dunham, and Phil Dunham.Joint attention: Its origins and role in development. Psychology Press, 2014

  60. [60]

    See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, et al. See, hear, and understand: Bench- marking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025

  61. [61]

    Read the room: Video social reasoning with mental-physical causal chains

    Lixing Niu, Jiapeng Li, Xingping Yu, Xinyi Dong, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Read the room: Video social reasoning with mental-physical causal chains. InThe Fourteenth International Conference on Learning Representations, 2026

  62. [62]

    read the room

    Lixing Niu, Jiapeng Li, Xingping Yu, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Rˆ 3-vqa:" read the room" by video social reasoning.arXiv preprint arXiv:2505.04147, 2025

  63. [63]

    Gpt-5.4 thinking system card

    OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, mar 2026. Official system card

  64. [64]

    Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

    Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

  65. [65]

    Gaze-vlm: Bridging gaze and vlms through attention regularization for ego- centric understanding.arXiv preprint arXiv:2510.21356, 2025

    Anupam Pani and Yanchao Yang. Gaze-vlm: Bridging gaze and vlms through attention regularization for egocentric understanding.arXiv preprint arXiv:2510.21356, 2025

  66. [66]

    Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

    Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

  67. [67]

    In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

    Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

  68. [68]

    Where are they looking?NeurIPS, 28, 2015

    Adria Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking?NeurIPS, 28, 2015

  69. [69]

    Gaze-lle: Gaze target estimation via large-scale learned encoders

    Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M Rehg. Gaze-lle: Gaze target estimation via large-scale learned encoders. 2025

  70. [70]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  71. [71]

    From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018

    Heung-Yeung Shum, Xiao-dong He, and Di Li. From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018. 12

  72. [72]

    Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024

    Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, and Xiangmin Xu. Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024

  73. [73]

    Betweenunderthinkingandoverthinking: Anempiricalstudyofreasoninglengthandcorrectnessinllms.arXivpreprintarXiv:2505.00127,2025

    Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

  74. [74]

    Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025

    Hamza Tahboub, Weiyan Shi, Gang Hua, and Huaizu Jiang. Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025

  75. [75]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  76. [76]

    Social caption: Evaluating social understanding in multimodal models,

    Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, and Louis-Philippe Morency. Social caption: Evaluating social understanding in multimodal models.arXiv preprint arXiv:2601.14569, 2026

  77. [77]

    Joint attention and early language.Child development, pages 1454–1463, 1986

    Michael Tomasello and Michael Jeffrey Farrar. Joint attention and early language.Child development, pages 1454–1463, 1986

  78. [78]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

  79. [79]

    Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025

    Shijing Wang, Chaoqun Cui, Yihua Cheng, and Yaping Huang. Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025

  80. [80]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

Showing first 80 references.