GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
Pith reviewed 2026-05-20 18:45 UTC · model grok-4.3
The pith
GRASP dataset and Social Grounding Reward link high-level social questions to specific gaze and gesture events in multi-person videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing questions from fine-grained, identity-consistent gaze trajectories and deictic gestures organized into a 16-category taxonomy, and by applying Social Grounding Reward during training, multimodal models improve their ability to ground social reasoning in the actual participants and events shown in multi-person videos.
What carries the argument
GRASP dataset built from gaze trajectories and deictic gestures, paired with the Social Grounding Reward (SGR) learning signal that reinforces participant identification in social events.
If this is right
- Models become better at determining which people are involved in each social event within crowded scenes.
- The 16-category taxonomy supplies structured supervision that can be reused across different video lengths and interaction types.
- Training with SGR leaves general social video question-answering performance intact in the zero-shot regime.
- The approach scales to 749 hours of video while remaining compatible with existing multimodal large language models.
Where Pith is reading between the lines
- The same grounding technique could be tested on live camera feeds to support real-time social awareness in robots or meeting assistants.
- Extending the taxonomy to additional non-verbal signals such as posture or proximity might further tighten the link between cues and social meaning.
- If the dataset's construction method generalizes, it offers a template for building grounded reasoning resources in other domains that mix perception and high-level inference.
Load-bearing premise
The videos and questions constructed from gaze and gesture events accurately represent real-world social interactions without meaningful selection or annotation bias.
What would settle it
Training models with SGR produces no measurable gain on GRASP-Bench or causes clear drops on zero-shot social video QA benchmarks.
Figures
read the original abstract
Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GRASP, a large-scale dataset containing 290K question-answer pairs derived from 46K multi-person videos (749 hours total), organized under a 16-category taxonomy spanning gaze, gesture, and joint gaze-gesture reasoning. It proposes the Social Grounding Reward (SGR) as a learning signal that leverages identity-consistent social events to encourage models to ground interactions by identifying participants. Experiments report that SGR improves performance on the introduced GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
Significance. If the empirical results are substantiated with rigorous controls, this work would provide a valuable large-scale resource and training mechanism for advancing multimodal large language models in fine-grained social reasoning over non-verbal cues, addressing a notable gap between isolated cue detection and high-level social QA.
major comments (2)
- [§5] §5 (Experiments): The reported performance improvements from SGR on GRASP-Bench are presented without details on the specific baselines compared, the train/validation/test splits employed, ablation studies isolating the reward component, or statistical significance testing, which are required to establish the robustness of the central empirical claim.
- [§3.2] §3.2 (Dataset Construction): The process of selecting videos based on identity-consistent gaze trajectories and deictic gestures, followed by composing QA pairs under the 16-category taxonomy, risks introducing selection bias toward clear, trackable interactions; this could inflate SGR gains on GRASP-Bench without ensuring generalization to ambiguous, occluded, or culturally diverse real-world scenes.
minor comments (2)
- [Abstract] Abstract: Expand to name the specific related social video QA benchmarks used for the zero-shot evaluation to provide immediate context for the preservation claim.
- [§4] §4 (Method): Clarify the exact formulation of the SGR loss or reward function, including any hyperparameters, to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, outlining how we will strengthen the presentation of experiments and dataset construction.
read point-by-point responses
-
Referee: §5 (Experiments): The reported performance improvements from SGR on GRASP-Bench are presented without details on the specific baselines compared, the train/validation/test splits employed, ablation studies isolating the reward component, or statistical significance testing, which are required to establish the robustness of the central empirical claim.
Authors: We agree that the current experimental section would benefit from greater detail to substantiate the central claims. In the revised manuscript we will expand §5 to explicitly list the baseline models and methods, describe the train/validation/test splits used for GRASP-Bench, present ablation studies that isolate the contribution of the Social Grounding Reward, and report statistical significance testing (e.g., paired t-tests or bootstrap intervals) for the observed improvements. revision: yes
-
Referee: §3.2 (Dataset Construction): The process of selecting videos based on identity-consistent gaze trajectories and deictic gestures, followed by composing QA pairs under the 16-category taxonomy, risks introducing selection bias toward clear, trackable interactions; this could inflate SGR gains on GRASP-Bench without ensuring generalization to ambiguous, occluded, or culturally diverse real-world scenes.
Authors: The emphasis on identity-consistent trajectories is deliberate: it enables reliable construction of QA pairs that link fine-grained non-verbal events to specific participants, which is the core motivation for both GRASP and SGR. We acknowledge that this design choice favors clearer interactions and may affect generalization. In the revision we will add an explicit limitations paragraph in §3.2 that discusses selection bias, ambiguous/occluded cases, and cultural diversity, together with qualitative examples illustrating the dataset's coverage. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical dataset construction (GRASP with 290K QA pairs from gaze/gesture events) and a reward signal (SGR) defined directly from those events to train models for social reasoning. No equations, parameter fits, or derivations are described that reduce a claimed prediction or result to the inputs by construction. Central claims rest on experimental performance lifts on GRASP-Bench and zero-shot retention elsewhere, which are falsifiable benchmarks rather than self-referential. No load-bearing self-citations or uniqueness theorems are invoked in the provided text to justify the method. The approach is self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Identity-consistent gaze trajectories and deictic gestures can be reliably extracted and used to generate social QA pairs
invented entities (1)
-
Social Grounding Reward (SGR)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events... Social Grounding Reward (SGR) ... verifies whether the model’s reasoning references the correct participants
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
System card: Claude sonnet 4.6
Anthropic. System card: Claude sonnet 4.6. https://www.anthropic.com/ claude-haiku-4-5-system-card, feb 2026. Official system card
work page 2026
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Escnet: Gaze target detection with the understanding of 3d scenes
Jun Bao, Buyu Liu, and Jun Yu. Escnet: Gaze target detection with the understanding of 3d scenes. In CVPR, pages 14126–14135, 2022
work page 2022
-
[6]
Tonko EW Bossen, Andreas Møgelmose, and Ross Greer. Can vision-language models understand and interpret dynamic gestures from pedestrians? pilot datasets and exploration towards instructive nonverbal commands for cooperative autonomous vehicles. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4779–4788, 2025
work page 2025
-
[7]
Socialgesture: Delving into multi-person gesture understanding
Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan, Sangmin Lee, and James M Rehg. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19509–19519, 2025
work page 2025
-
[8]
Toward human deictic gesture target estimation
Xu Cao, Pranav Virupaksha, Sangmin Lee, Bolin Lai, Wenqi Jia, Jintai Chen, and James Matthew Rehg. Toward human deictic gesture target estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[9]
Gaze target estimation anywhere with concepts
Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, and James M Rehg. Gaze target estimation anywhere with concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026
work page 2026
-
[10]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026
-
[12]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
-
[13]
Detecting attended visual targets in video
Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M Rehg. Detecting attended visual targets in video. InCVPR, pages 5396–5406, 2020
work page 2020
-
[14]
InstructBLIP: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[15]
Retinaface: Single-shot multi-level face localisation in the wild
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020
work page 2020
-
[16]
Inferring shared attention in social scene videos
Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. Inferring shared attention in social scene videos. InCVPR, pages 6460–6468, 2018
work page 2018
-
[17]
Understanding human gaze communication by spatio-temporal graph reasoning
Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5724–5733, 2019
work page 2019
-
[18]
Dual attention guided gaze target detection in the wild
Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. Dual attention guided gaze target detection in the wild. InCVPR, pages 11390–11399, 2021
work page 2021
-
[19]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012
Chris D Frith and Uta Frith. Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012
work page 2012
-
[21]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
work page 2025
-
[22]
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Émilie Gagnon-St-Pierre, Marina M Doucerain, and Henry Markovits. Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021
work page 2021
-
[24]
Google Deepmind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, feb 2026. Official system card
work page 2026
-
[25]
Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, and Jean-marc Odobez. Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024
work page 2024
-
[26]
Anshul Gupta, Samy Tafasca, and Jean-Marc Odobez. A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings. InCVPRW, pages 5041–5050, 2022
work page 2022
-
[27]
Exploring the zero-shot capabilities of vision-language models for improving gaze following
Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, and Jean-Marc Odobez. Exploring the zero-shot capabilities of vision-language models for improving gaze following. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 615–624, 2024
work page 2024
-
[28]
Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019
Judith A Hall, Terrence G Horgan, and Nora A Murphy. Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019
work page 2019
-
[29]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations
Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, and Mike Shou. Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10462–10479, 2023
work page 2023
-
[31]
Tianlei Jin, Qizhi Yu, Shiqiang Zhu, Zheyuan Lin, Jie Ren, Yuanhai Zhou, and Wei Song. Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022
work page 2022
-
[32]
Mathis Jording, Arne Hartz, Gary Bente, Martin Schulte-Rüther, and Kai V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in psychology, 9:226, 2018
work page 2018
-
[33]
Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions. arXiv preprint arXiv:2511.16221, 2025
-
[34]
Hagrid–hand gesture recognition image dataset
Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid–hand gesture recognition image dataset. InWACV, pages 4572–4581, 2024
work page 2024
-
[35]
Kobin H Kendrick, Judith Holler, and Stephen C Levinson. Turn-taking in human face-to-face interaction is multimodal: gaze direction and manual gestures aid the coordination of turn transitions.Philosophical transactions of the royal society B, 378(1875):20210473, 2023
work page 2023
-
[36]
Junho Kim, Hyunjun Kim, Hosu Lee, and Yong Man Ro. Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024
-
[37]
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, and Xue Feng. Siv-bench: A video benchmark for social interaction understanding and reasoning.arXiv preprint arXiv:2506.05425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955
work page 1955
-
[39]
Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games
Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James Rehg, and Diyi Yang. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. InFindings of ACL, pages 6570–6588, 2023
work page 2023
-
[40]
Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, and James M Rehg. Modeling multimodal social interactions: new challenges and baselines with densely aligned representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14585–14595, 2024
work page 2024
-
[41]
Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan, Xu Cao, Ozgur Kara, Bikram Boote, Weiyan Shi, Diyi Yang, et al. Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024
-
[42]
Tvqa: Localized, compositional video question answering
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018
work page 2018
-
[43]
Tvqa+: Spatio-temporal grounding for video question answering
Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020
work page 2020
-
[44]
Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025. 11
-
[45]
Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025
Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M Rehg, and Yapeng Tian. Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025
-
[46]
Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, and Yapeng Tian. Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026
-
[47]
In the eye of beholder: Joint learning of gaze and actions in first person video
Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, pages 619–635, 2018
work page 2018
-
[48]
Zhuoming Li, Aitong Liu, Mengxi Jia, Yubo Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, and Xuelong Li. Gestura: A lvlm-powered system bridging motion and semantics for real-time free-form gesture understanding.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(4):1–29, 2025
work page 2025
-
[49]
Zongyu Lin, Zhikun Xu, Xiaohan Song, Yixin Wan, Xingcheng Yao, Tsung-Han Lin, Selina Song, Pranav Subbaraman, Ben Zhou, Kai-Wei Chang, et al. V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19025–19047, 2025
work page 2025
-
[50]
Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition
Dan Liu, Libo Zhang, and Yanjun Wu. Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition. InCVPR, pages 3304–3312, 2022
work page 2022
-
[51]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[53]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Manuel Marin-Jimenez, Andrew Zisserman, and Vittorio Ferrari. " here’s looking at you, kid": Detecting people looking at each other in videos. InBMVC. British Machine Vision Association and Society for Pattern Recognition, 2011
work page 2011
-
[55]
Athul M Mathew, Haithem Hermassi, Thariq Khalid, and Arshad Ali Khan. Gazevlm: A vision-language model for multi-task gaze understanding.arXiv preprint arXiv:2511.06348, 2025
-
[56]
Social genome: Grounded social reasoning abilities of multimodal models
Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24879–24902, 2025
work page 2025
-
[57]
arXiv preprint arXiv:2510.16258 , year=
Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025
-
[58]
University of Chicago press, 1992
David McNeill.Hand and mind: What gestures reveal about thought. University of Chicago press, 1992
work page 1992
-
[59]
Chris Moore, Philip J Dunham, and Phil Dunham.Joint attention: Its origins and role in development. Psychology Press, 2014
work page 2014
-
[60]
Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, et al. See, hear, and understand: Bench- marking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Read the room: Video social reasoning with mental-physical causal chains
Lixing Niu, Jiapeng Li, Xingping Yu, Xinyi Dong, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Read the room: Video social reasoning with mental-physical causal chains. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[62]
Lixing Niu, Jiapeng Li, Xingping Yu, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Rˆ 3-vqa:" read the room" by video social reasoning.arXiv preprint arXiv:2505.04147, 2025
-
[63]
OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, mar 2026. Official system card
work page 2026
-
[64]
Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025
-
[65]
Anupam Pani and Yanchao Yang. Gaze-vlm: Bridging gaze and vlms through attention regularization for egocentric understanding.arXiv preprint arXiv:2510.21356, 2025
-
[66]
Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025
-
[67]
Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025
-
[68]
Where are they looking?NeurIPS, 28, 2015
Adria Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking?NeurIPS, 28, 2015
work page 2015
-
[69]
Gaze-lle: Gaze target estimation via large-scale learned encoders
Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M Rehg. Gaze-lle: Gaze target estimation via large-scale learned encoders. 2025
work page 2025
-
[70]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Heung-Yeung Shum, Xiao-dong He, and Di Li. From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018. 12
work page 2018
-
[72]
Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, and Xiangmin Xu. Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024
work page 2024
-
[73]
Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025
-
[74]
Hamza Tahboub, Weiyan Shi, Gang Hua, and Huaizu Jiang. Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025
-
[75]
Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[76]
Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, and Louis-Philippe Morency. Social caption: Evaluating social understanding in multimodal models.arXiv preprint arXiv:2601.14569, 2026
-
[77]
Joint attention and early language.Child development, pages 1454–1463, 1986
Michael Tomasello and Michael Jeffrey Farrar. Joint attention and early language.Child development, pages 1454–1463, 1986
work page 1986
-
[78]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025
Shijing Wang, Chaoqun Cui, Yihua Cheng, and Yaping Huang. Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025
work page 2025
-
[80]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.