GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Ana Jojic; Bikram Boote; Bolin Lai; Fiona Ryan; Houze Yang; James M. Rehg; Junho Kim; Sangmin Lee; Xu Cao

arxiv: 2605.15764 · v1 · pith:T6G7VVPXnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Junho Kim , Xu Cao , Houze Yang , Bikram Boote , Ana Jojic , Fiona Ryan , Bolin Lai , Sangmin Lee

show 1 more author

James M. Rehg

This is my paper

Pith reviewed 2026-05-20 18:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords social reasoninggaze trackingdeictic gesturesmulti-person videomultimodal LLMsgrounding rewardsocial QA dataset

0 comments

The pith

GRASP dataset and Social Grounding Reward link high-level social questions to specific gaze and gesture events in multi-person videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GRASP, a dataset of 290K question-answer pairs across 46K videos that builds social reasoning tasks directly from identity-consistent gaze trajectories and deictic gestures. It pairs this resource with GRASP-Bench and proposes Social Grounding Reward (SGR) as a training signal that encourages models to identify the participants in each interaction. Experiments indicate that SGR raises accuracy on GRASP-Bench while preserving zero-shot results on existing social video QA benchmarks. A sympathetic reader cares because current multimodal models routinely misidentify who is interacting with whom when non-verbal cues are subtle and multiple people are present.

Core claim

By constructing questions from fine-grained, identity-consistent gaze trajectories and deictic gestures organized into a 16-category taxonomy, and by applying Social Grounding Reward during training, multimodal models improve their ability to ground social reasoning in the actual participants and events shown in multi-person videos.

What carries the argument

GRASP dataset built from gaze trajectories and deictic gestures, paired with the Social Grounding Reward (SGR) learning signal that reinforces participant identification in social events.

If this is right

Models become better at determining which people are involved in each social event within crowded scenes.
The 16-category taxonomy supplies structured supervision that can be reused across different video lengths and interaction types.
Training with SGR leaves general social video question-answering performance intact in the zero-shot regime.
The approach scales to 749 hours of video while remaining compatible with existing multimodal large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding technique could be tested on live camera feeds to support real-time social awareness in robots or meeting assistants.
Extending the taxonomy to additional non-verbal signals such as posture or proximity might further tighten the link between cues and social meaning.
If the dataset's construction method generalizes, it offers a template for building grounded reasoning resources in other domains that mix perception and high-level inference.

Load-bearing premise

The videos and questions constructed from gaze and gesture events accurately represent real-world social interactions without meaningful selection or annotation bias.

What would settle it

Training models with SGR produces no measurable gain on GRASP-Bench or causes clear drops on zero-shot social video QA benchmarks.

Figures

Figures reproduced from arXiv: 2605.15764 by Ana Jojic, Bikram Boote, Bolin Lai, Fiona Ryan, Houze Yang, James M. Rehg, Junho Kim, Sangmin Lee, Xu Cao.

**Figure 1.** Figure 1: Example from GRASP. Multi-person social reasoning requires grounding subtle non-verbal cues in the correct participants over time. Existing MLLMs [19, 78] often take spurious scene-level shortcuts, whereas ours leverage evidence-aware supervision to reason from the relevant social event. The key hypothesis underlying this work is that modern MLLMs [52, 14, 51], which integrate visual perception with strong… view at source ↗

**Figure 2.** Figure 2: Overview of the GRASP construction pipeline and QA examples. We convert multi-person videos into person-consistent gaze and gesture events, compose them into structured social QA pairs, and apply subset validation with human feedback for quality control. additional online video sources, as such contents contain dense multi-person interactions with rich social signals. Our dataset comprises 46K videos, from… view at source ↗

**Figure 3.** Figure 3: GRASP taxonomy and statistics. Social Reasoning QA Generation. QA pairs are generated from structured event metadata derived from gaze and gesture interactions using a closed-source model [24]. Each question is constructed by querying key attributes such as participant identities, temporal intervals, and interaction types, ensuring that answers are directly verifiable without exhaustive manual annotation… view at source ↗

**Figure 5.** Figure 5: Grounded participant precision—accuracy on GRASP-Bench across various reasoning baselines. Marker size reflects the average number of novel participants mentioned in the reasoning trace. often produce verbose but ungrounded reasoning traces that fail to identify the relevant individuals involved in social interactions. As shown in Tab. 3, incorporating GRPO on top of the baseline yields moderate improveme… view at source ↗

**Figure 6.** Figure 6: Human validation interface. Evaluators inspect each QA instance with the corresponding [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Dataset-level scale across the six source domains. We report the number of source videos, [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of retained gaze event types and gesture types. Gaze events are filtered at [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: QA category distribution. The benchmark contains 16 categories: six gaze categories [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Data construction yield. The pipeline starts from source videos, detects raw gaze and [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Average gaze and gesture event density per video. Social Deduction Game has the highest [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of MCQ training examples by modality and difficulty. Open-ended examples [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracy compared against average reasoning length for all baselines. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Error profile for SGR compared against difficulty of the questions for the base models. To understand the difficulty of the GRASP-Bench tasks, we group cases into buckets defined by how many base models were correct, and the accuracy of the our models per bucket was computed as shown in [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison on an easy GRASP-Bench example. VL-Rethinker-7B Qwen3VL + SGR <think>… The question asks how long Person 3’s gesture lasts, starting at 1.0 seconds… To answer, I should track the gesture type and the time span from the start until it ends, then compare that duration with the options… <gesture>From 1s to 7s, Person 3 pointed at Person 4.</gesture> The gesture begins at 1.0 seconds an… view at source ↗

**Figure 16.** Figure 16: Qualitative comparison on a medium GRASP-Bench example. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison on a hard GRASP-Bench example. Qwen3VL-8B Qwen3VL + SGR <think> So, let's try to figure out this problem… The question is about identifying which player P2 is talking to in the sentence “Yeah. You have to admit that if you want to- (To <PREDICT>)”. We need to look at the gaze and gesture interactions in the video. First, let's recall the conversation context. The target utterance is… view at source ↗

**Figure 18.** Figure 18: Qualitative comparison on MMSI speaker target identification (STI). [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison on MMSI pronoun coreference resolution (PCR). [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative comparison on MMSI mentioned-player prediction (MPP). [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative comparison on TVQA+. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: Failure cases on GRASP-Bench. We show two representative errors: an ambiguous deictic gesture where reaching and pointing cues are visually close, and a crowded gaze-reasoning case where multiple gaze events occur within the target interval. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative GRASP-Bench examples for gaze reasoning, covering T1–T6. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Qualitative GRASP-Bench examples for gesture reasoning, covering G1–G6. t = 38.5s t = 41.5s Question (J4, hard): t = 3.5s t = 42.0s t = 42.5s t = 43.0s eye contact Person2 points at Person1 twice in this video, around 3.5 to 42.5 s. During which pointing gesture do Person2 and Person1 make eye contact? Options: A) Only during the first pointing gesture (2.5s – 4.0s) B) Only during the second pointing gest… view at source ↗

**Figure 25.** Figure 25: Qualitative GRASP-Bench examples for joint gaze–gesture reasoning, covering J1–J4. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗

**Figure 26.** Figure 26: Prompt for deictic gesture annotation. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗

**Figure 27.** Figure 27: Prompt QA generation. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗

read the original abstract

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRASP introduces a dataset linking social QA to identity-consistent gaze and gesture events but the gains from SGR rest on thin experimental reporting.

read the letter

GRASP stands out for creating a dataset that directly ties high-level social questions to specific non-verbal events like gaze and deictic gestures in multi-person videos, plus a reward mechanism to train on those connections. The new part is the GRASP dataset with its 290K QA pairs over 749 hours of video and a 16-category taxonomy that includes joint gaze-gesture reasoning. They construct questions from identity-consistent trajectories and gestures, which goes beyond prior work that either isolates cues or stays at high-level QA. The Social Grounding Reward (SGR) uses these events as a learning signal to push models to identify participants in interactions. If the full results hold up, this could help MLLMs do better at social reasoning without losing general capabilities. The abstract reports that SGR improves GRASP-Bench performance while keeping zero-shot results on other benchmarks, but it skips over baselines, splits, significance tests, and ablations. That leaves the main empirical claim thin on support. The construction method also raises a question about bias: starting from clear, trackable gaze and gesture events might select for easier cases and under-represent ambiguous or occluded interactions, so any gains could partly reflect that curation rather than broader improvements. Readers focused on multimodal social video understanding or grounding in large models would get the most from this. The dataset and taxonomy offer a concrete way to evaluate and train for participant identification in social scenes. It deserves a serious referee because the scale and the connection between levels are substantive, even with the current gaps in experimental detail. I would send this to peer review. The authors should expand the methods and results sections to address the missing comparisons and test for robustness on less curated data.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GRASP, a large-scale dataset containing 290K question-answer pairs derived from 46K multi-person videos (749 hours total), organized under a 16-category taxonomy spanning gaze, gesture, and joint gaze-gesture reasoning. It proposes the Social Grounding Reward (SGR) as a learning signal that leverages identity-consistent social events to encourage models to ground interactions by identifying participants. Experiments report that SGR improves performance on the introduced GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

Significance. If the empirical results are substantiated with rigorous controls, this work would provide a valuable large-scale resource and training mechanism for advancing multimodal large language models in fine-grained social reasoning over non-verbal cues, addressing a notable gap between isolated cue detection and high-level social QA.

major comments (2)

[§5] §5 (Experiments): The reported performance improvements from SGR on GRASP-Bench are presented without details on the specific baselines compared, the train/validation/test splits employed, ablation studies isolating the reward component, or statistical significance testing, which are required to establish the robustness of the central empirical claim.
[§3.2] §3.2 (Dataset Construction): The process of selecting videos based on identity-consistent gaze trajectories and deictic gestures, followed by composing QA pairs under the 16-category taxonomy, risks introducing selection bias toward clear, trackable interactions; this could inflate SGR gains on GRASP-Bench without ensuring generalization to ambiguous, occluded, or culturally diverse real-world scenes.

minor comments (2)

[Abstract] Abstract: Expand to name the specific related social video QA benchmarks used for the zero-shot evaluation to provide immediate context for the preservation claim.
[§4] §4 (Method): Clarify the exact formulation of the SGR loss or reward function, including any hyperparameters, to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, outlining how we will strengthen the presentation of experiments and dataset construction.

read point-by-point responses

Referee: §5 (Experiments): The reported performance improvements from SGR on GRASP-Bench are presented without details on the specific baselines compared, the train/validation/test splits employed, ablation studies isolating the reward component, or statistical significance testing, which are required to establish the robustness of the central empirical claim.

Authors: We agree that the current experimental section would benefit from greater detail to substantiate the central claims. In the revised manuscript we will expand §5 to explicitly list the baseline models and methods, describe the train/validation/test splits used for GRASP-Bench, present ablation studies that isolate the contribution of the Social Grounding Reward, and report statistical significance testing (e.g., paired t-tests or bootstrap intervals) for the observed improvements. revision: yes
Referee: §3.2 (Dataset Construction): The process of selecting videos based on identity-consistent gaze trajectories and deictic gestures, followed by composing QA pairs under the 16-category taxonomy, risks introducing selection bias toward clear, trackable interactions; this could inflate SGR gains on GRASP-Bench without ensuring generalization to ambiguous, occluded, or culturally diverse real-world scenes.

Authors: The emphasis on identity-consistent trajectories is deliberate: it enables reliable construction of QA pairs that link fine-grained non-verbal events to specific participants, which is the core motivation for both GRASP and SGR. We acknowledge that this design choice favors clearer interactions and may affect generalization. In the revision we will add an explicit limitations paragraph in §3.2 that discusses selection bias, ambiguous/occluded cases, and cultural diversity, together with qualitative examples illustrating the dataset's coverage. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical dataset construction (GRASP with 290K QA pairs from gaze/gesture events) and a reward signal (SGR) defined directly from those events to train models for social reasoning. No equations, parameter fits, or derivations are described that reduce a claimed prediction or result to the inputs by construction. Central claims rest on experimental performance lifts on GRASP-Bench and zero-shot retention elsewhere, which are falsifiable benchmarks rather than self-referential. No load-bearing self-citations or uniqueness theorems are invoked in the provided text to justify the method. The approach is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions from multimodal learning and video annotation; SGR is introduced as a new training signal without independent external validation.

axioms (1)

domain assumption Identity-consistent gaze trajectories and deictic gestures can be reliably extracted and used to generate social QA pairs
Invoked in the description of how questions are built from events.

invented entities (1)

Social Grounding Reward (SGR) no independent evidence
purpose: Learning signal that uses social events to encourage models to reason about interaction participants
Newly proposed in this work to train on the GRASP dataset.

pith-pipeline@v0.9.0 · 5738 in / 1248 out tokens · 32273 ms · 2026-05-20T18:45:50.370744+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events... Social Grounding Reward (SGR) ... verifies whether the model’s reasoning references the correct participants

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 16 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

System card: Claude sonnet 4.6

Anthropic. System card: Claude sonnet 4.6. https://www.anthropic.com/ claude-haiku-4-5-system-card, feb 2026. Official system card

work page 2026
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Escnet: Gaze target detection with the understanding of 3d scenes

Jun Bao, Buyu Liu, and Jun Yu. Escnet: Gaze target detection with the understanding of 3d scenes. In CVPR, pages 14126–14135, 2022

work page 2022
[6]

Tonko EW Bossen, Andreas Møgelmose, and Ross Greer. Can vision-language models understand and interpret dynamic gestures from pedestrians? pilot datasets and exploration towards instructive nonverbal commands for cooperative autonomous vehicles. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4779–4788, 2025

work page 2025
[7]

Socialgesture: Delving into multi-person gesture understanding

Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan, Sangmin Lee, and James M Rehg. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19509–19519, 2025

work page 2025
[8]

Toward human deictic gesture target estimation

Xu Cao, Pranav Virupaksha, Sangmin Lee, Bolin Lai, Wenqi Jia, Jintai Chen, and James Matthew Rehg. Toward human deictic gesture target estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[9]

Gaze target estimation anywhere with concepts

Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, and James M Rehg. Gaze target estimation anywhere with concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[10]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

work page arXiv 2026
[12]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

work page arXiv 2025
[13]

Detecting attended visual targets in video

Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M Rehg. Detecting attended visual targets in video. InCVPR, pages 5396–5406, 2020

work page 2020
[14]

InstructBLIP: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[15]

Retinaface: Single-shot multi-level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

work page 2020
[16]

Inferring shared attention in social scene videos

Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. Inferring shared attention in social scene videos. InCVPR, pages 6460–6468, 2018

work page 2018
[17]

Understanding human gaze communication by spatio-temporal graph reasoning

Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5724–5733, 2019

work page 2019
[18]

Dual attention guided gaze target detection in the wild

Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. Dual attention guided gaze target detection in the wild. InCVPR, pages 11390–11399, 2021

work page 2021
[19]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012

Chris D Frith and Uta Frith. Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012

work page 2012
[21]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025
[22]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021

Émilie Gagnon-St-Pierre, Marina M Doucerain, and Henry Markovits. Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021

work page 2021
[24]

Gemini 3.1 pro model card

Google Deepmind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, feb 2026. Official system card

work page 2026
[25]

Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024

Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, and Jean-marc Odobez. Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024

work page 2024
[26]

A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings

Anshul Gupta, Samy Tafasca, and Jean-Marc Odobez. A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings. InCVPRW, pages 5041–5050, 2022

work page 2022
[27]

Exploring the zero-shot capabilities of vision-language models for improving gaze following

Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, and Jean-Marc Odobez. Exploring the zero-shot capabilities of vision-language models for improving gaze following. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 615–624, 2024

work page 2024
[28]

Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019

Judith A Hall, Terrence G Horgan, and Nora A Murphy. Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019

work page 2019
[29]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations

Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, and Mike Shou. Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10462–10479, 2023

work page 2023
[31]

Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022

Tianlei Jin, Qizhi Yu, Shiqiang Zhu, Zheyuan Lin, Jie Ren, Yuanhai Zhou, and Wei Song. Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022

work page 2022
[32]

social gaze space

Mathis Jording, Arne Hartz, Gary Bente, Martin Schulte-Rüther, and Kai V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in psychology, 9:226, 2018

work page 2018
[33]

Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions. arXiv preprint arXiv:2511.16221, 2025

work page arXiv 2025
[34]

Hagrid–hand gesture recognition image dataset

Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid–hand gesture recognition image dataset. InWACV, pages 4572–4581, 2024

work page 2024
[35]

Kobin H Kendrick, Judith Holler, and Stephen C Levinson. Turn-taking in human face-to-face interaction is multimodal: gaze direction and manual gestures aid the coordination of turn transitions.Philosophical transactions of the royal society B, 378(1875):20210473, 2023

work page 2023
[36]

Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024

Junho Kim, Hyunjun Kim, Hosu Lee, and Yong Man Ro. Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024

work page arXiv 2024
[37]

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, and Xue Feng. Siv-bench: A video benchmark for social interaction understanding and reasoning.arXiv preprint arXiv:2506.05425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

work page 1955
[39]

Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games

Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James Rehg, and Diyi Yang. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. InFindings of ACL, pages 6570–6588, 2023

work page 2023
[40]

Modeling multimodal social interactions: new challenges and baselines with densely aligned representations

Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, and James M Rehg. Modeling multimodal social interactions: new challenges and baselines with densely aligned representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14585–14595, 2024

work page 2024
[41]

Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024

Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan, Xu Cao, Ozgur Kara, Bikram Boote, Weiyan Shi, Diyi Yang, et al. Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024

work page arXiv 2024
[42]

Tvqa: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018

work page 2018
[43]

Tvqa+: Spatio-temporal grounding for video question answering

Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020

work page 2020
[44]

Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025

Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025. 11

work page arXiv 2025
[45]

Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M Rehg, and Yapeng Tian. Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

work page arXiv 2025
[46]

Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026

Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, and Yapeng Tian. Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026

work page arXiv 2026
[47]

In the eye of beholder: Joint learning of gaze and actions in first person video

Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, pages 619–635, 2018

work page 2018
[48]

Zhuoming Li, Aitong Liu, Mengxi Jia, Yubo Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, and Xuelong Li. Gestura: A lvlm-powered system bridging motion and semantics for real-time free-form gesture understanding.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(4):1–29, 2025

work page 2025
[49]

V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning

Zongyu Lin, Zhikun Xu, Xiaohan Song, Yixin Wan, Xingcheng Yao, Tsung-Han Lin, Selina Song, Pranav Subbaraman, Ben Zhou, Kai-Wei Chang, et al. V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19025–19047, 2025

work page 2025
[50]

Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition

Dan Liu, Libo Zhang, and Yanjun Wu. Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition. InCVPR, pages 3304–3312, 2022

work page 2022
[51]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[53]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

here’s looking at you, kid

Manuel Marin-Jimenez, Andrew Zisserman, and Vittorio Ferrari. " here’s looking at you, kid": Detecting people looking at each other in videos. InBMVC. British Machine Vision Association and Society for Pattern Recognition, 2011

work page 2011
[55]

Gazevlm: A vision-language model for multi-task gaze understanding.arXiv preprint arXiv:2511.06348, 2025

Athul M Mathew, Haithem Hermassi, Thariq Khalid, and Arshad Ali Khan. Gazevlm: A vision-language model for multi-task gaze understanding.arXiv preprint arXiv:2511.06348, 2025

work page arXiv 2025
[56]

Social genome: Grounded social reasoning abilities of multimodal models

Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24879–24902, 2025

work page 2025
[57]

arXiv preprint arXiv:2510.16258 , year=

Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

work page arXiv 2025
[58]

University of Chicago press, 1992

David McNeill.Hand and mind: What gestures reveal about thought. University of Chicago press, 1992

work page 1992
[59]

Psychology Press, 2014

Chris Moore, Philip J Dunham, and Phil Dunham.Joint attention: Its origins and role in development. Psychology Press, 2014

work page 2014
[60]

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, et al. See, hear, and understand: Bench- marking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Read the room: Video social reasoning with mental-physical causal chains

Lixing Niu, Jiapeng Li, Xingping Yu, Xinyi Dong, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Read the room: Video social reasoning with mental-physical causal chains. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[62]

read the room

Lixing Niu, Jiapeng Li, Xingping Yu, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Rˆ 3-vqa:" read the room" by video social reasoning.arXiv preprint arXiv:2505.04147, 2025

work page arXiv 2025
[63]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, mar 2026. Official system card

work page 2026
[64]

Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

work page arXiv 2025
[65]

Gaze-vlm: Bridging gaze and vlms through attention regularization for ego- centric understanding.arXiv preprint arXiv:2510.21356, 2025

Anupam Pani and Yanchao Yang. Gaze-vlm: Bridging gaze and vlms through attention regularization for egocentric understanding.arXiv preprint arXiv:2510.21356, 2025

work page arXiv 2025
[66]

Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

work page arXiv 2025
[67]

In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

work page arXiv 2025
[68]

Where are they looking?NeurIPS, 28, 2015

Adria Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking?NeurIPS, 28, 2015

work page 2015
[69]

Gaze-lle: Gaze target estimation via large-scale learned encoders

Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M Rehg. Gaze-lle: Gaze target estimation via large-scale learned encoders. 2025

work page 2025
[70]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018

Heung-Yeung Shum, Xiao-dong He, and Di Li. From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018. 12

work page 2018
[72]

Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024

Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, and Xiangmin Xu. Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024

work page 2024
[73]

Betweenunderthinkingandoverthinking: Anempiricalstudyofreasoninglengthandcorrectnessinllms.arXivpreprintarXiv:2505.00127,2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025
[74]

Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025

Hamza Tahboub, Weiyan Shi, Gang Hua, and Huaizu Jiang. Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025

work page arXiv 2025
[75]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

Social caption: Evaluating social understanding in multimodal models.arXiv preprint arXiv:2601.14569, 2026

Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, and Louis-Philippe Morency. Social caption: Evaluating social understanding in multimodal models.arXiv preprint arXiv:2601.14569, 2026

work page arXiv 2026
[77]

Joint attention and early language.Child development, pages 1454–1463, 1986

Michael Tomasello and Michael Jeffrey Farrar. Joint attention and early language.Child development, pages 1454–1463, 1986

work page 1986
[78]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025

Shijing Wang, Chaoqun Cui, Yihua Cheng, and Yaping Huang. Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025

work page 2025
[80]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

System card: Claude sonnet 4.6

Anthropic. System card: Claude sonnet 4.6. https://www.anthropic.com/ claude-haiku-4-5-system-card, feb 2026. Official system card

work page 2026

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Escnet: Gaze target detection with the understanding of 3d scenes

Jun Bao, Buyu Liu, and Jun Yu. Escnet: Gaze target detection with the understanding of 3d scenes. In CVPR, pages 14126–14135, 2022

work page 2022

[6] [6]

Tonko EW Bossen, Andreas Møgelmose, and Ross Greer. Can vision-language models understand and interpret dynamic gestures from pedestrians? pilot datasets and exploration towards instructive nonverbal commands for cooperative autonomous vehicles. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4779–4788, 2025

work page 2025

[7] [7]

Socialgesture: Delving into multi-person gesture understanding

Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan, Sangmin Lee, and James M Rehg. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19509–19519, 2025

work page 2025

[8] [8]

Toward human deictic gesture target estimation

Xu Cao, Pranav Virupaksha, Sangmin Lee, Bolin Lai, Wenqi Jia, Jintai Chen, and James Matthew Rehg. Toward human deictic gesture target estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[9] [9]

Gaze target estimation anywhere with concepts

Xu Cao, Houze Yang, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, and James M Rehg. Gaze target estimation anywhere with concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026

[10] [10]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

work page arXiv 2026

[12] [12]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

work page arXiv 2025

[13] [13]

Detecting attended visual targets in video

Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M Rehg. Detecting attended visual targets in video. InCVPR, pages 5396–5406, 2020

work page 2020

[14] [14]

InstructBLIP: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[15] [15]

Retinaface: Single-shot multi-level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

work page 2020

[16] [16]

Inferring shared attention in social scene videos

Lifeng Fan, Yixin Chen, Ping Wei, Wenguan Wang, and Song-Chun Zhu. Inferring shared attention in social scene videos. InCVPR, pages 6460–6468, 2018

work page 2018

[17] [17]

Understanding human gaze communication by spatio-temporal graph reasoning

Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5724–5733, 2019

work page 2019

[18] [18]

Dual attention guided gaze target detection in the wild

Yi Fang, Jiapeng Tang, Wang Shen, Wei Shen, Xiao Gu, Li Song, and Guangtao Zhai. Dual attention guided gaze target detection in the wild. InCVPR, pages 11390–11399, 2021

work page 2021

[19] [19]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012

Chris D Frith and Uta Frith. Mechanisms of social cognition.Annual review of psychology, 63:287–313, 2012

work page 2012

[21] [21]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025

[22] [22]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021

Émilie Gagnon-St-Pierre, Marina M Doucerain, and Henry Markovits. Reasoning strategies explain individual differences in social reasoning.Journal of Experimental Psychology: General, 150(2):340, 2021

work page 2021

[24] [24]

Gemini 3.1 pro model card

Google Deepmind. Gemini 3.1 pro model card. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, feb 2026. Official system card

work page 2026

[25] [25]

Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024

Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, and Jean-marc Odobez. Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction.Advances in Neural Information Processing Systems, 37:15646–15673, 2024

work page 2024

[26] [26]

A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings

Anshul Gupta, Samy Tafasca, and Jean-Marc Odobez. A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings. InCVPRW, pages 5041–5050, 2022

work page 2022

[27] [27]

Exploring the zero-shot capabilities of vision-language models for improving gaze following

Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, and Jean-Marc Odobez. Exploring the zero-shot capabilities of vision-language models for improving gaze following. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 615–624, 2024

work page 2024

[28] [28]

Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019

Judith A Hall, Terrence G Horgan, and Nora A Murphy. Nonverbal communication.Annual review of psychology, 70(2019):271–294, 2019

work page 2019

[29] [29]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations

Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, and Mike Shou. Gazevqa: A video question answering dataset for multiview eye-gaze task-oriented collaborations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10462–10479, 2023

work page 2023

[31] [31]

Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022

Tianlei Jin, Qizhi Yu, Shiqiang Zhu, Zheyuan Lin, Jie Ren, Yuanhai Zhou, and Wei Song. Depth-aware gaze-following via auxiliary networks for robotics.Engineering Applications of Artificial Intelligence, 113:104924, 2022

work page 2022

[32] [32]

social gaze space

Mathis Jording, Arne Hartz, Gary Bente, Martin Schulte-Rüther, and Kai V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in psychology, 9:226, 2018

work page 2018

[33] [33]

Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions. arXiv preprint arXiv:2511.16221, 2025

work page arXiv 2025

[34] [34]

Hagrid–hand gesture recognition image dataset

Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid–hand gesture recognition image dataset. InWACV, pages 4572–4581, 2024

work page 2024

[35] [35]

Kobin H Kendrick, Judith Holler, and Stephen C Levinson. Turn-taking in human face-to-face interaction is multimodal: gaze direction and manual gestures aid the coordination of turn transitions.Philosophical transactions of the royal society B, 378(1875):20210473, 2023

work page 2023

[36] [36]

Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024

Junho Kim, Hyunjun Kim, Hosu Lee, and Yong Man Ro. Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis.arXiv preprint arXiv:2411.16173, 2024

work page arXiv 2024

[37] [37]

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, and Xue Feng. Siv-bench: A video benchmark for social interaction understanding and reasoning.arXiv preprint arXiv:2506.05425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97, 1955

work page 1955

[39] [39]

Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games

Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James Rehg, and Diyi Yang. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. InFindings of ACL, pages 6570–6588, 2023

work page 2023

[40] [40]

Modeling multimodal social interactions: new challenges and baselines with densely aligned representations

Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, and James M Rehg. Modeling multimodal social interactions: new challenges and baselines with densely aligned representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14585–14595, 2024

work page 2024

[41] [41]

Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024

Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan, Xu Cao, Ozgur Kara, Bikram Boote, Weiyan Shi, Diyi Yang, et al. Towards social ai: A survey on understanding social interactions.arXiv preprint arXiv:2409.15316, 2024

work page arXiv 2024

[42] [42]

Tvqa: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018

work page 2018

[43] [43]

Tvqa+: Spatio-temporal grounding for video question answering

Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020

work page 2020

[44] [44]

Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025

Hengzhi Li, Megan Tjandrasuwita, Yi R Fung, Armando Solar-Lezama, and Paul Pu Liang. Mimeqa: Towards socially-intelligent nonverbal foundation models.arXiv preprint arXiv:2502.16671, 2025. 11

work page arXiv 2025

[45] [45]

Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M Rehg, and Yapeng Tian. Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

work page arXiv 2025

[46] [46]

Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026

Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, and Yapeng Tian. Omni-mmsi: Toward identity-attributed social interaction understanding.arXiv preprint arXiv:2604.00267, 2026

work page arXiv 2026

[47] [47]

In the eye of beholder: Joint learning of gaze and actions in first person video

Yin Li, Miao Liu, and James M Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. InECCV, pages 619–635, 2018

work page 2018

[48] [48]

Zhuoming Li, Aitong Liu, Mengxi Jia, Yubo Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, and Xuelong Li. Gestura: A lvlm-powered system bridging motion and semantics for real-time free-form gesture understanding.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(4):1–29, 2025

work page 2025

[49] [49]

V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning

Zongyu Lin, Zhikun Xu, Xiaohan Song, Yixin Wan, Xingcheng Yao, Tsung-Han Lin, Selina Song, Pranav Subbaraman, Ben Zhou, Kai-Wei Chang, et al. V-alphasocial: Benchmark and self-reflective chain-of-thought generation for visual social commonsense reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19025–19047, 2025

work page 2025

[50] [50]

Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition

Dan Liu, Libo Zhang, and Yanjun Wu. Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recognition. InCVPR, pages 3304–3312, 2022

work page 2022

[51] [51]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[53] [53]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

here’s looking at you, kid

Manuel Marin-Jimenez, Andrew Zisserman, and Vittorio Ferrari. " here’s looking at you, kid": Detecting people looking at each other in videos. InBMVC. British Machine Vision Association and Society for Pattern Recognition, 2011

work page 2011

[55] [55]

Gazevlm: A vision-language model for multi-task gaze understanding.arXiv preprint arXiv:2511.06348, 2025

Athul M Mathew, Haithem Hermassi, Thariq Khalid, and Arshad Ali Khan. Gazevlm: A vision-language model for multi-task gaze understanding.arXiv preprint arXiv:2511.06348, 2025

work page arXiv 2025

[56] [56]

Social genome: Grounded social reasoning abilities of multimodal models

Leena Mathur, Marian Qian, Paul Pu Liang, and Louis-Philippe Morency. Social genome: Grounded social reasoning abilities of multimodal models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24879–24902, 2025

work page 2025

[57] [57]

arXiv preprint arXiv:2510.16258 , year=

Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

work page arXiv 2025

[58] [58]

University of Chicago press, 1992

David McNeill.Hand and mind: What gestures reveal about thought. University of Chicago press, 1992

work page 1992

[59] [59]

Psychology Press, 2014

Chris Moore, Philip J Dunham, and Phil Dunham.Joint attention: Its origins and role in development. Psychology Press, 2014

work page 2014

[60] [60]

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, et al. See, hear, and understand: Bench- marking audiovisual human speech understanding in multimodal large language models.arXiv preprint arXiv:2512.02231, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Read the room: Video social reasoning with mental-physical causal chains

Lixing Niu, Jiapeng Li, Xingping Yu, Xinyi Dong, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Read the room: Video social reasoning with mental-physical causal chains. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[62] [62]

read the room

Lixing Niu, Jiapeng Li, Xingping Yu, Shu Wang, Ruining Feng, Bo Wu, Ping Wei, Yisen Wang, and Lifeng Fan. Rˆ 3-vqa:" read the room" by video social reasoning.arXiv preprint arXiv:2505.04147, 2025

work page arXiv 2025

[63] [63]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. https://openai.com/index/ gpt-5-4-thinking-system-card/, mar 2026. Official system card

work page 2026

[64] [64]

Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

work page arXiv 2025

[65] [65]

Gaze-vlm: Bridging gaze and vlms through attention regularization for ego- centric understanding.arXiv preprint arXiv:2510.21356, 2025

Anupam Pani and Yanchao Yang. Gaze-vlm: Bridging gaze and vlms through attention regularization for egocentric understanding.arXiv preprint arXiv:2510.21356, 2025

work page arXiv 2025

[66] [66]

Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

work page arXiv 2025

[67] [67]

In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

work page arXiv 2025

[68] [68]

Where are they looking?NeurIPS, 28, 2015

Adria Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking?NeurIPS, 28, 2015

work page 2015

[69] [69]

Gaze-lle: Gaze target estimation via large-scale learned encoders

Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M Rehg. Gaze-lle: Gaze target estimation via large-scale learned encoders. 2025

work page 2025

[70] [70]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018

Heung-Yeung Shum, Xiao-dong He, and Di Li. From eliza to xiaoice: challenges and opportunities with social chatbots.Frontiers of Information Technology & Electronic Engineering, 19:10–26, 2018. 12

work page 2018

[72] [72]

Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024

Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, and Xiangmin Xu. Vitgaze: gaze following with interaction features in vision transformers.Visual Intelligence, 2(1):1–15, 2024

work page 2024

[73] [73]

Betweenunderthinkingandoverthinking: Anempiricalstudyofreasoninglengthandcorrectnessinllms.arXivpreprintarXiv:2505.00127,2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025

[74] [74]

Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025

Hamza Tahboub, Weiyan Shi, Gang Hua, and Huaizu Jiang. Socialfusion: Addressing social degradation in pre-trained vision-language models.arXiv preprint arXiv:2512.01148, 2025

work page arXiv 2025

[75] [75]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[76] [76]

Social caption: Evaluating social understanding in multimodal models.arXiv preprint arXiv:2601.14569, 2026

Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, and Louis-Philippe Morency. Social caption: Evaluating social understanding in multimodal models.arXiv preprint arXiv:2601.14569, 2026

work page arXiv 2026

[77] [77]

Joint attention and early language.Child development, pages 1454–1463, 1986

Michael Tomasello and Michael Jeffrey Farrar. Joint attention and early language.Child development, pages 1454–1463, 1986

work page 1986

[78] [78]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025

Shijing Wang, Chaoqun Cui, Yihua Cheng, and Yaping Huang. Gaze following in question answering: A comprehensive benchmark for vision-language models, 2025

work page 2025

[80] [80]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025