Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

arxiv: 2605.19859 · v1 · pith:5CRPHCHFnew · submitted 2026-05-19 · 💻 cs.CV

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Hengfei Wang , Anshul Gupta , Pierre Vuillecard , Jean-Marc Odobez This is my paper

Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision language modelsgaze followingsocial gaze predictionbenchmark evaluationhuman attention understandingzero-shot and fine-tuningmultimodal reasoningspatial grounding

0 comments p. Extension

pith:5CRPHCHF Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{5CRPHCHF}

Prints a linked pith:5CRPHCHF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Current vision-language models lack the precision needed for reliable gaze following and social gaze prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether vision-language models can accurately determine where a person is looking in a scene and reason about social attention behaviors like mutual gaze. It evaluates many models in zero-shot settings with varied prompts and after fine-tuning on gaze-related question-answer pairs. The work compares results against dedicated visual models using existing gaze datasets for two tasks: geometric gaze following that requires spatial scene understanding and social gaze prediction that leans on relational reasoning. Findings indicate that VLMs fall short on precise localization and interaction analysis. A reader would care because better gaze understanding could let multimodal AI systems support applications that depend on reading human attention and social cues.

Core claim

The paper establishes that vision-language models do not yet possess precise gaze understanding. In zero-shot evaluations across open- and closed-source models, performance on gaze following and social gaze prediction trails that of specialized visual models. Fine-tuning on task-specific QA data reduces the gap but leaves substantial shortfalls, particularly in spatial grounding and multi-person relational reasoning. The authors conclude that meaningful advances will require more than standard training approaches.

What carries the argument

The EyeVLM evaluation framework, which measures VLMs on gaze following (2D target localization from face and scene cues) and social gaze prediction (reasoning over interactions such as shared attention) through both prompting strategies and fine-tuning.

If this is right

Gaze-related tasks expose limits in how VLMs integrate geometric visual processing with semantic reasoning.
Scaling model size or data volume alone is unlikely to close the performance gap without targeted changes to spatial grounding.
Hybrid systems that pair VLMs with dedicated visual gaze modules could offer immediate practical gains.
Social gaze prediction may improve faster than pure geometric following because it can draw on language-based relational knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better gaze capabilities would likely transfer to adjacent problems such as action recognition and scene description that also depend on attention cues.
Extending the evaluation to video sequences could reveal whether current models handle dynamic gaze shifts over time.
The same benchmarking approach could be applied to other subtle human signals like gesture or posture to map broader VLM strengths and weaknesses.

Load-bearing premise

The chosen existing gaze datasets together with the prompting and fine-tuning methods give a fair and representative picture of how VLMs would handle gaze tasks in everyday conditions.

What would settle it

A controlled test in which one or more current VLMs, after standard fine-tuning, match or exceed the accuracy of top visual models on the same gaze following and social gaze benchmarks would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.19859 by Anshul Gupta, Hengfei Wang, Jean-Marc Odobez, Pierre Vuillecard.

**Figure 1.** Figure 1: EyeVLM overview. Left: Gaze Following (GFo) and Social Gaze Prediction (SG). Right: model comparison on GazeFollow [1] (GFo distance error ↓) and VideoAttentionTarget [2] (SG F1 ↑). Results are taken from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative results for gaze following on the GazeFollow dataset. For each sample, the left [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLMs still trail visual models on gaze following and social gaze tasks, with the main open question being whether the chosen datasets and prompts give them a fair shot.

read the letter

This paper benchmarks vision-language models on two gaze-related tasks and concludes they still fall short of specialized visual models. That's the core finding to take away. The new part is the EyeVLM framework. It evaluates VLMs on gaze following, which involves locating the target of someone's attention in an image, and social gaze prediction, which looks at interactions between people. They run this in zero-shot mode with various prompting strategies on both open and closed models, and also try fine-tuning with task-specific question-answer pairs. They compare against existing visual models using standard gaze datasets. What works here is the dual-task design. Gaze following tests the geometric and visual side, while social gaze taps into relational reasoning that might play to the language model's strengths. Testing both zero-shot and fine-tuned regimes, plus scaling effects, gives a broader picture than just one setting. Using public datasets for comparison is straightforward and lets them build on prior work without new data collection. The soft spot is in how representative the evaluation is. The results hinge on the specific datasets and the prompting or fine-tuning choices. If those don't capture enough variety in real-world scenes or don't push the models' semantic capabilities hard enough, the observed gaps could be partly an artifact of the setup rather than a fundamental limit. The abstract mentions systematic comparisons but doesn't flag any ablations on prompt variants or dataset coverage, which leaves that open. This kind of work is for researchers tracking how general multimodal models perform on human behavior tasks. People studying attention modeling or VLM applications in social scenes would find the comparisons useful. It has enough concrete evaluation to merit a serious referee, even if revisions are needed on the methods details. I would send this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces EyeVLM, a systematic evaluation framework for assessing vision-language models (VLMs) on two gaze-related tasks: gaze following (predicting 2D look-at locations, emphasizing geometric and spatial reasoning) and social gaze prediction (reasoning about multi-person interactions like mutual gaze). It benchmarks a range of open- and closed-source VLMs in zero-shot settings with varied prompting and in a fine-tuning regime using task-specific QA pairs, comparing results against state-of-the-art purely visual models on existing public gaze datasets. The central claim is that current VLMs lack precise gaze understanding capabilities, though standard training narrows the gap to visual models without closing it.

Significance. If the evaluation holds, the work usefully documents current limitations of VLMs in a domain that combines visual grounding with social and relational reasoning, an area relevant to human behavior understanding, robotics, and HCI. Strengths include the dual-task design (geometric vs. social), the inclusion of both zero-shot and fine-tuning protocols, explicit comparison to visual baselines, and reliance on public datasets rather than new private ones. These elements provide a reproducible starting point for future VLM improvements in attention modeling.

major comments (2)

[§4 and §5] §4 (Evaluation Protocol) and §5 (Results): The central claim that VLMs 'lack precise gaze understanding capabilities' rests on performance gaps versus visual models, yet the manuscript provides no ablation or coverage analysis of the selected datasets' scene diversity (e.g., proportion of multi-person social contexts, indoor/outdoor balance, or head-pose variation). If the chosen datasets over-represent constrained lab settings, the observed shortfalls may reflect dataset properties rather than intrinsic VLM limits, directly affecting the generalizability stated in the abstract.
[§3.2] §3.2 (Prompting Strategies): The zero-shot evaluation uses a set of prompting strategies, but no systematic comparison (e.g., chain-of-thought variants or structured output formats for 2D coordinate extraction) is reported. This leaves open whether the reported VLM shortfalls could be mitigated by more effective exploitation of the language component, which is load-bearing for the claim that semantic reasoning in VLMs does not yet compensate for visual grounding weaknesses.

minor comments (2)

[Tables 1-2] Table 1 and Table 2: Ensure all reported metrics include standard deviations or confidence intervals across runs or seeds to allow readers to assess stability of the VLM vs. visual-model gaps.
[Figure 2] Figure 2 (qualitative examples): Add explicit annotations for ground-truth gaze vectors and model predictions to improve readability of failure cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional analyses and experiments where they strengthen the work.

read point-by-point responses

Referee: [§4 and §5] §4 (Evaluation Protocol) and §5 (Results): The central claim that VLMs 'lack precise gaze understanding capabilities' rests on performance gaps versus visual models, yet the manuscript provides no ablation or coverage analysis of the selected datasets' scene diversity (e.g., proportion of multi-person social contexts, indoor/outdoor balance, or head-pose variation). If the chosen datasets over-represent constrained lab settings, the observed shortfalls may reflect dataset properties rather than intrinsic VLM limits, directly affecting the generalizability stated in the abstract.

Authors: We agree that explicit coverage analysis would better support claims of generalizability. The datasets are standard public benchmarks previously characterized in the gaze literature, but we have added a new analysis subsection to §4 reporting quantitative statistics on multi-person scenes, indoor/outdoor distribution, and head-pose variation. This shows a mix of lab and real-world settings with substantial non-lab content. We have updated the discussion and abstract to reference these statistics, confirming that the observed gaps are unlikely to be explained solely by dataset bias. revision: yes
Referee: [§3.2] §3.2 (Prompting Strategies): The zero-shot evaluation uses a set of prompting strategies, but no systematic comparison (e.g., chain-of-thought variants or structured output formats for 2D coordinate extraction) is reported. This leaves open whether the reported VLM shortfalls could be mitigated by more effective exploitation of the language component, which is load-bearing for the claim that semantic reasoning in VLMs does not yet compensate for visual grounding weaknesses.

Authors: Section 3.2 already evaluates multiple prompting variants (direct, descriptive, and few-shot). To provide the requested systematic comparison, we have added experiments with chain-of-thought prompting and structured JSON output formats for coordinate extraction. These results appear in the revised §5. The additional strategies produce modest gains, yet the performance gap relative to visual models persists. This supports the original conclusion that language-based reasoning does not yet fully offset visual grounding limitations in VLMs. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical VLM benchmarking

full rationale

This is an empirical evaluation paper that applies existing public gaze datasets to test VLMs in zero-shot and fine-tuned settings, then compares results against published visual-model baselines. No equations, parameter fits, derivations, or self-citations are invoked to support the central claims; all quantitative results derive from external benchmarks and standard prompting/fine-tuning protocols whose correctness can be independently reproduced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard computer-vision assumptions about dataset representativeness and the validity of current prompting techniques; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption Existing gaze-following and social-gaze datasets are sufficiently representative of real-world attentional behavior for benchmarking purposes.
The evaluation framework directly uses these datasets to draw conclusions about VLM capabilities.

pith-pipeline@v0.9.0 · 5857 in / 1182 out tokens · 42151 ms · 2026-05-20T05:53:52.424698+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate both our tasks into a unified VLM-compatible visual question answering setting... Gaze Following... Social Gaze... prompting strategies (Chain-of-Thought, In-Context learning)... supervised fine-tuning... QA pairs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 9 internal anchors

[1]

Where are they looking?

A. Recasens, A. Khosla, C. V ondrick, and A. Torralba, “Where are they looking?” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://papers.neurips.cc/paper/5848-where-are-they-looking.pdf

work page 2015
[2]

Detecting attended visual targets in video,

E. Chong, Y . Wang, N. Ruiz, and J. M. Rehg, “Detecting attended visual targets in video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5396–5406. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/papers/Chong_ Detecting_Attended_Visual_Targets_in_Video_CVPR_2020_paper.pdf

work page 2020
[3]

The eyes have it: The neuroethology, function and evolution of social gaze,

N. J. Emery, “The eyes have it: The neuroethology, function and evolution of social gaze,”Neuroscience & Biobehavioral Reviews, vol. 24, no. 6, pp. 581–604, 2000

work page 2000
[4]

Gaze- lle: Gaze target estimation via large-scale learned encoders,

F. Ryan, A. Bati, S. Lee, D. Bolya, J. Hoffman, and J. M. Rehg, “Gaze- lle: Gaze target estimation via large-scale learned encoders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2025/papers/Ryan_Gaze-LLE_Gaze_Target_ Estimation_via_Large-Sc...

work page 2025
[5]

Gaze cueing of attention: Visual attention, social cognition, and individual differences,

A. Frischen, A. P. Bayliss, and S. P. Tipper, “Gaze cueing of attention: Visual attention, social cognition, and individual differences,”Psychological Bulletin, vol. 133, no. 4, pp. 694–724, 2007

work page 2007
[6]

Laeo-net++: Revisiting people looking at each other in videos,

M. J. Marin-Jimenez, V . Kalogeiton, P. Medina-Suarez, and A. Zisserman, “Laeo-net++: Revisiting people looking at each other in videos,”arXiv preprint arXiv:2101.02136, 2021. [Online]. Available: https://arxiv.org/abs/2101.02136

work page arXiv 2021
[7]

A unified model for gaze following and social gaze prediction,

A. Gupta, S. Tafasca, N. Chutisilp, and J.-M. Odobez, “A unified model for gaze following and social gaze prediction,” inIEEE International Conference on Automatic Face and Gesture Recognition (FG), 2024. [Online]. Available: https://publications.idiap.ch/attachments/papers/2024/Gupta_FG_2024.pdf

work page 2024
[8]

Inferring shared attention in social scene videos,

L. Fan, W. Chen, P. Wei, S.-K. Yeung, T. Wong, and J. Xing, “Inferring shared attention in social scene videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

work page
[9]

Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Fan_Inferring_ Shared_Attention_CVPR_2018_paper.pdf

[Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Fan_Inferring_ Shared_Attention_CVPR_2018_paper.pdf

work page
[10]

Sharingan: A transformer architecture for multi-person gaze following,

S. Tafasca, A. Gupta, and J.-M. Odobez, “Sharingan: A transformer architecture for multi-person gaze following,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2024/papers/Tafasca_ Sharingan_A_Transformer_Architecture_for_Multi-Person_Gaze_Fol...

work page 2024
[11]

Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction,

A. Gupta, S. Tafasca, A. Farkhondeh, P. Vuillecard, and J.-M. Odobez, “Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024. [Online]. Available: https://arxiv.org/abs/2403.10511

work page arXiv 2024
[12]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inProceedings of the International Conference on Machine Learning (ICML), 2023. [Online]. Available: https://arxiv.org/abs/2301.12597

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Visual Instruction Tuning

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Language Is Not All You Need: Aligning Perception with Language Models

S. Huang, L. Dong, W. Wang, Y . Yang, F. Wang, F. Liu, Z. Chi, T. Zhang, Q. Li, F. Linet al., “Kosmos-1: Multimodal large language model in the wild,”arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Social caption: Evaluating social understanding in multimodal models,

B. Thumu, L. Mathur, Y . Kebe, and L.-P. Morency, “Social caption: Evaluating social understanding in multimodal models,” 2026. [Online]. Available: https://arxiv.org/abs/2601.14569

work page arXiv 2026
[17]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Dual attention guided gaze target detection in the wild,

Y . Fang, J. Tang, W. Shen, W. Shen, X. Gu, L. Song, and G. Zhai, “Dual attention guided gaze target detection in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 390–11 399. 10

work page 2021
[19]

Believe it or not, we know what you are looking at!

D. Lian, Z. Yu, and S. Gao, “Believe it or not, we know what you are looking at!” inAsian Conference on Computer Vision. Springer, 2018, pp. 35–50

work page 2018
[20]

A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings,

A. Gupta, S. Tafasca, and J.-M. Odobez, “A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022, pp. 5041–5050

work page 2022
[21]

Depth-aware gaze-following via auxiliary networks for robotics,

T. Jin, Q. Yu, S. Zhu, Z. Lin, J. Ren, Y . Zhou, and W. Song, “Depth-aware gaze-following via auxiliary networks for robotics,”Engineering Applications of Artificial Intelligence, vol. 113, p. 104924, 2022

work page 2022
[22]

Childplay: A new benchmark for understanding children’s gaze behaviour,

S. Tafasca, A. Gupta, and J.-M. Odobez, “Childplay: A new benchmark for understanding children’s gaze behaviour,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 20 935–20 946

work page 2023
[23]

Patch-level gaze distribution prediction for gaze following,

Q. Miao, M. Hoai, and D. Samaras, “Patch-level gaze distribution prediction for gaze following,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 880–889. [Online]. Available: https://openaccess.thecvf.com/content/W ACV2023/papers/Miao_ Patch-Level_Gaze_Distribution_Prediction_for_Gaze_Following_W ACV_20...

work page 2023
[24]

Detecting people looking at each other in videos,

M. J. Marin-Jimenez, A. Zisserman, M. Eichner, and V . Ferrari, “Detecting people looking at each other in videos,”International Journal of Computer Vision (IJCV), 2013. [Online]. Available: https://www.robots.ox.ac.uk/~vgg/publications/2014/Marin14/marin14.pdf

work page 2013
[25]

Laeo-net: Re- visiting people looking at each other in videos,

M. J. Marin-Jimenez, V . Kalogeiton, P. Medina-Suarez, and A. Zisserman, “Laeo-net: Re- visiting people looking at each other in videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Avail- able: https://openaccess.thecvf.com/content_CVPR_2019/papers/Marin-Jimenez_LAEO-Net_Revisiting_ People_L...

work page 2019
[26]

Boosting image- based mutual gaze detection using pseudo 3d gaze,

B. Doosti, C.-H. Chen, R. Vemulapalli, X. Jia, Y . Zhu, and B. Green, “Boosting image- based mutual gaze detection using pseudo 3d gaze,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, pp. 1273–1281, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16215

work page 2021
[27]

Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty,

G. Cantarini, F. F. Tomenotti, N. Noceti, and F. Odone, “Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty,” 2021

work page 2021
[28]

Attention flow: End-to-end joint attention estimation,

Ö. Sümer, P. Gerjets, U. Trautwein, and E. Kasneci, “Attention flow: End-to-end joint attention estimation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. [Online]. Available: https://openaccess.thecvf.com/content_W ACV_2020/papers/Sumer_Attention_Flow_ End-to-End_Joint_Attention_Estimation_W ACV_2020_paper.pdf

work page 2020
[29]

Exploring the zero-shot capabilities of vision- language models for improving gaze following,

A. Gupta, P. Vuillecard, A. Farkhondeh, and J.-M. Odobez, “Exploring the zero-shot capabilities of vision- language models for improving gaze following,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2024, pp. 615–624

work page 2024
[30]

Gazellm: a plug-and-play zero-shot llm reasoning framework for boosting gaze target detection,

Y . Yang and F. Lu, “Gazellm: a plug-and-play zero-shot llm reasoning framework for boosting gaze target detection,”Visual Intelligence, vol. 3, no. 1, p. 26, 2025

work page 2025
[31]

Gazevlm: A vision-language model for multi-task gaze understanding,

A. M. Mathew, H. Hermassi, T. Khalid, and A. A. Khan, “Gazevlm: A vision-language model for multi-task gaze understanding,”arXiv preprint arXiv:2511.06348, 2025

work page arXiv 2025
[32]

Vl4gaze: Unleashing vision-language models for gaze following,

S. Wang, C. Cui, Y . Huang, H. J. Chang, and Y . Cheng, “Vl4gaze: Unleashing vision-language models for gaze following,”arXiv preprint arXiv:2512.20735, 2025

work page arXiv 2025
[33]

Chain- of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain- of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[34]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” arXiv preprint arXiv:2205.11916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Learn to explain: Multimodal reasoning via thought chains for science question answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[36]

Multimodal chain-of-thought reasoning in language models,

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”Transactions on Machine Learning Research (TMLR), 2024. 11

work page 2024
[37]

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Y . Wu, Z. Yang, J. Qian, S. Gao, G. Chen, Q. Li, Y .-A. Huang, and Z.-A. Huang, “Better eyes, better thoughts: Why vision chain-of-thought fails in medicine,”arXiv preprint arXiv:2603.06665, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Visual chain-of-thought prompting for knowledge-based visual reasoning,

Z.-F. Chenet al., “Visual chain-of-thought prompting for knowledge-based visual reasoning,” inProceed- ings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024
[39]

Chain of thought prompt tuning in vision language models,

J. Ge, H. Luo, S. Qian, Y . Gan, J. Fu, and S. Zhang, “Chain of thought prompt tuning in vision language models,”arXiv preprint arXiv:2304.07919, 2023

work page arXiv 2023
[40]

Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,

B. Ji, S. Agrawal, Q. Tang, and Y . Wu, “Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,”arXiv preprint arXiv:2507.13362, 2025

work page arXiv 2025
[41]

mtgs-static-vsgaze,

A. Gupta, “mtgs-static-vsgaze,” 2026, accessed: 2026-05-06. [Online]. Available: https: //huggingface.co/Idiap/mtgs-static-vsgaze

work page 2026
[42]

Multimodal across domains gaze target detection,

F. Tonini, C. Beyan, and E. Ricci, “Multimodal across domains gaze target detection,” inProceedings of the 2022 International Conference on Multimodal Interaction, 2022, pp. 420–431

work page 2022
[43]

Gaze target estimation inspired by interactive attention,

Z. Hu, K. Zhao, B. Zhou, H. Guo, S. Wu, Y . Yang, and J. Liu, “Gaze target estimation inspired by interactive attention,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8524–8536, 2022

work page 2022
[44]

Where are they looking in the 3d space?

N. Horanyi, L. Zheng, E. Chong, A. Leonardis, and H. J. Chang, “Where are they looking in the 3d space?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 2678–2687

work page 2023
[45]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Gpt-5.4 thinking system card,

OpenAI, “Gpt-5.4 thinking system card,” https://openai.com/index/gpt-5-4-thinking-system-card/, 2026

work page 2026
[47]

Gemini 3.1 pro model card,

Google DeepMind, “Gemini 3.1 pro model card,” https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

work page 2026
[48]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: http...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[50]

Austism diagnostic observation schedule: A standardized observation of communicative and social behavior,

C. Lord, M. Rutter, S. Goode, J. Heemsbergen, H. Jordan, L. Mawhood, and E. Schopler, “Austism diagnostic observation schedule: A standardized observation of communicative and social behavior,” Journal of autism and developmental disorders, vol. 19, no. 2, pp. 185–212, 1989

work page 1989
[51]

Autism Diagnostic Observation Schedule, Second Edition,

Lord, Rutter, DiLavore, Risi, Gotham, and Bishop, “Autism Diagnostic Observation Schedule, Second Edition,” 2012. 12 A Implementation Details We fine-tune representative Qwen-based VLMs on the constructed QA pairs using LLaMA- Factory [47]. Specifically, LoRA adapters are inserted into both the self-attention projection layers (q_proj, k_proj, v_proj, and...

work page 2012
[53]

Identify the activity or activities in which the person of interest is engaged in

work page
[54]

Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

Determine the direction of gaze of the person of interest. Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

work page
[55]

Evaluate as well whether the person of interest is looking at something that is within the image area, or outside the image area

Identify and locate all plausible gaze targets (objects and/or persons) in the image that are visible to the person of interest, that is, that are within their field-of-view. Evaluate as well whether the person of interest is looking at something that is within the image area, or outside the image area

work page
[56]

Describe the spatial relationships between the person of interest and each of these plausible gaze targets

work page
[57]

Based on the scene context and information above, estimate how possible the person of interest is looking inside the frame and which plausible gaze target is the person of interest most likely looking at regardless of whether it is inside or outside the frame

work page
[58]

inout": <0.0-1.0>,

Using your reasoning, localize the most probable target of attention of the person of interest as a point. Use this point as the final gaze point. Provide the requested outputs (inout and gaze_point). ### Input/Output Format: - **Input:** An image followed by a set of **bounding box coordinates** of the head of the person of interest. - **Output:** Report...

work page
[59]

The person of interest is the woman on the left, identified by the provided head bounding box

The image shows a café or casual dining setting with two people sitting at a table. The person of interest is the woman on the left, identified by the provided head bounding box

work page
[60]

Her head and upper body are turned toward the right side of the image, which indicates that her attention is directed toward the man beside her or toward an object near him

work page
[61]

This makes the interaction area on the right side of the image the most likely region of attention

Her visible facial orientation suggests a rightward gaze rather than a downward look at the table or an upward look toward the background shelves. This makes the interaction area on the right side of the image the most likely region of attention

work page
[62]

The cup is the stronger candidate because it is visually salient, close to her line of sight, and central to the man’s ongoing action, while his face is partly occluded by the cup

Within that rightward region, the most plausible gaze targets are the man’s face and the large white cup he is holding. The cup is the stronger candidate because it is visually salient, close to her line of sight, and central to the man’s ongoing action, while his face is partly occluded by the cup. By contrast, other objects such as the glass on the tabl...

work page
[63]

inout": 1.0,

Therefore, the most likely gaze target is the large white cup on the right side of the image. Since this target is clearly inside the frame, the probability of looking inside the image should be high. ### Gaze Point ### ```json [ {"inout": 1.0, "gaze_point": [0.81, 0.24]} ] ``` user: <image>[0.333, 0.056, 0.498, 0.221] 20 assistant: ### Reasoning ###

work page
[64]

The person of interest is the player in red at the center, identified by the given head bounding box

The image shows a field sport scene with several players. The person of interest is the player in red at the center, identified by the given head bounding box

work page
[65]

His face orientation suggests that his attention is not on the ground directly below him or on the distant players in the background

His body is moving forward, but his head is turned toward the left side of the image. His face orientation suggests that his attention is not on the ground directly below him or on the distant players in the background

work page
[66]

It is close to him, in front of his hurley, and consistent with the ongoing action

The most visually salient object in the direction of his head and eyes is the small ball on the left side of the image. It is close to him, in front of his hurley, and consistent with the ongoing action

work page
[67]

The ball is much more consistent with both the sports context and his immediate action

Other possible targets, such as the nearby teammate behind him or the player in blue on the right, are less likely because they are not aligned with his apparent line of sight. The ball is much more consistent with both the sports context and his immediate action

work page
[68]

The ball is clearly inside the frame, so the probability that he is looking inside the image is very high

work page
[69]

inout": 1.0,

Therefore, the most plausible gaze target is the ball on the left side of the image, with the gaze point placed near its center. ### Gaze Point ### ```json [ {"inout": 1.0, "gaze_point": [0.14, 0.23]} ] ``` Example provided. Now, analyze the following image. K Zero-Shot Prompts for SG For simplicity, we use <Task description> to denote the description of ...

work page
[70]

Describe what is happening in the image

work page
[71]

Identify the activity or activities in which the people of interest are engaged in

work page
[72]

Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

Determine the direction of gaze of the people of interest. Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

work page
[73]

Identify and locate all plausible gaze targets (objects and/or persons) in the image that are visible to the people of interest, that is, that are within their field-of-view

work page
[74]

Describe the spatial relationships between the people of interest and each of these plausible gaze targets

work page
[75]

Based on the scene context and information above, determine whether the people of interest are engaged in social gaze

work page
[76]

### Input/Output Format: - **Input:** An image followed by a pair of **bounding box coordinates** of the heads of the people of interest

Using your reasoning, estimate the probability of <task> for the people of interest. ### Input/Output Format: - **Input:** An image followed by a pair of **bounding box coordinates** of the heads of the people of interest. - **Output:** Report in JSON format: - **label** the probability of <task>. **Required Output Format:** ### Reasoning ### <Your step-b...

work page
[77]

The two people are a woman in the center and a child to her lower left, sitting together on a bed with others around them

work page
[78]

They appear to be interacting in a group conversation or family scene

work page
[79]

The child’s head is tilted upward toward the woman, so the child is looking at the woman’s face/head

work page
[80]

The woman’s head is turned slightly down and left toward the child, indicating her gaze is directed at the child

work page
[81]

Other plausible gaze targets exist in the room, but both people’s faces are oriented toward each other more than toward anyone else

work page

Showing first 80 references.

[1] [1]

Where are they looking?

A. Recasens, A. Khosla, C. V ondrick, and A. Torralba, “Where are they looking?” in Advances in Neural Information Processing Systems (NeurIPS), 2015. [Online]. Available: https://papers.neurips.cc/paper/5848-where-are-they-looking.pdf

work page 2015

[2] [2]

Detecting attended visual targets in video,

E. Chong, Y . Wang, N. Ruiz, and J. M. Rehg, “Detecting attended visual targets in video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5396–5406. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/papers/Chong_ Detecting_Attended_Visual_Targets_in_Video_CVPR_2020_paper.pdf

work page 2020

[3] [3]

The eyes have it: The neuroethology, function and evolution of social gaze,

N. J. Emery, “The eyes have it: The neuroethology, function and evolution of social gaze,”Neuroscience & Biobehavioral Reviews, vol. 24, no. 6, pp. 581–604, 2000

work page 2000

[4] [4]

Gaze- lle: Gaze target estimation via large-scale learned encoders,

F. Ryan, A. Bati, S. Lee, D. Bolya, J. Hoffman, and J. M. Rehg, “Gaze- lle: Gaze target estimation via large-scale learned encoders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2025/papers/Ryan_Gaze-LLE_Gaze_Target_ Estimation_via_Large-Sc...

work page 2025

[5] [5]

Gaze cueing of attention: Visual attention, social cognition, and individual differences,

A. Frischen, A. P. Bayliss, and S. P. Tipper, “Gaze cueing of attention: Visual attention, social cognition, and individual differences,”Psychological Bulletin, vol. 133, no. 4, pp. 694–724, 2007

work page 2007

[6] [6]

Laeo-net++: Revisiting people looking at each other in videos,

M. J. Marin-Jimenez, V . Kalogeiton, P. Medina-Suarez, and A. Zisserman, “Laeo-net++: Revisiting people looking at each other in videos,”arXiv preprint arXiv:2101.02136, 2021. [Online]. Available: https://arxiv.org/abs/2101.02136

work page arXiv 2021

[7] [7]

A unified model for gaze following and social gaze prediction,

A. Gupta, S. Tafasca, N. Chutisilp, and J.-M. Odobez, “A unified model for gaze following and social gaze prediction,” inIEEE International Conference on Automatic Face and Gesture Recognition (FG), 2024. [Online]. Available: https://publications.idiap.ch/attachments/papers/2024/Gupta_FG_2024.pdf

work page 2024

[8] [8]

Inferring shared attention in social scene videos,

L. Fan, W. Chen, P. Wei, S.-K. Yeung, T. Wong, and J. Xing, “Inferring shared attention in social scene videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

work page

[9] [9]

Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Fan_Inferring_ Shared_Attention_CVPR_2018_paper.pdf

[Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Fan_Inferring_ Shared_Attention_CVPR_2018_paper.pdf

work page

[10] [10]

Sharingan: A transformer architecture for multi-person gaze following,

S. Tafasca, A. Gupta, and J.-M. Odobez, “Sharingan: A transformer architecture for multi-person gaze following,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2024/papers/Tafasca_ Sharingan_A_Transformer_Architecture_for_Multi-Person_Gaze_Fol...

work page 2024

[11] [11]

Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction,

A. Gupta, S. Tafasca, A. Farkhondeh, P. Vuillecard, and J.-M. Odobez, “Mtgs: A novel framework for multi-person temporal gaze following and social gaze prediction,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024. [Online]. Available: https://arxiv.org/abs/2403.10511

work page arXiv 2024

[12] [12]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the International Conference on Machine Learning (ICML), 2021. [Online]. Available: https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inProceedings of the International Conference on Machine Learning (ICML), 2023. [Online]. Available: https://arxiv.org/abs/2301.12597

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Visual Instruction Tuning

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://arxiv.org/abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Language Is Not All You Need: Aligning Perception with Language Models

S. Huang, L. Dong, W. Wang, Y . Yang, F. Wang, F. Liu, Z. Chi, T. Zhang, Q. Li, F. Linet al., “Kosmos-1: Multimodal large language model in the wild,”arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Social caption: Evaluating social understanding in multimodal models,

B. Thumu, L. Mathur, Y . Kebe, and L.-P. Morency, “Social caption: Evaluating social understanding in multimodal models,” 2026. [Online]. Available: https://arxiv.org/abs/2601.14569

work page arXiv 2026

[17] [17]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Dual attention guided gaze target detection in the wild,

Y . Fang, J. Tang, W. Shen, W. Shen, X. Gu, L. Song, and G. Zhai, “Dual attention guided gaze target detection in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 390–11 399. 10

work page 2021

[19] [19]

Believe it or not, we know what you are looking at!

D. Lian, Z. Yu, and S. Gao, “Believe it or not, we know what you are looking at!” inAsian Conference on Computer Vision. Springer, 2018, pp. 35–50

work page 2018

[20] [20]

A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings,

A. Gupta, S. Tafasca, and J.-M. Odobez, “A modular multimodal architecture for gaze target prediction: Application to privacy-sensitive settings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022, pp. 5041–5050

work page 2022

[21] [21]

Depth-aware gaze-following via auxiliary networks for robotics,

T. Jin, Q. Yu, S. Zhu, Z. Lin, J. Ren, Y . Zhou, and W. Song, “Depth-aware gaze-following via auxiliary networks for robotics,”Engineering Applications of Artificial Intelligence, vol. 113, p. 104924, 2022

work page 2022

[22] [22]

Childplay: A new benchmark for understanding children’s gaze behaviour,

S. Tafasca, A. Gupta, and J.-M. Odobez, “Childplay: A new benchmark for understanding children’s gaze behaviour,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 20 935–20 946

work page 2023

[23] [23]

Patch-level gaze distribution prediction for gaze following,

Q. Miao, M. Hoai, and D. Samaras, “Patch-level gaze distribution prediction for gaze following,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 880–889. [Online]. Available: https://openaccess.thecvf.com/content/W ACV2023/papers/Miao_ Patch-Level_Gaze_Distribution_Prediction_for_Gaze_Following_W ACV_20...

work page 2023

[24] [24]

Detecting people looking at each other in videos,

M. J. Marin-Jimenez, A. Zisserman, M. Eichner, and V . Ferrari, “Detecting people looking at each other in videos,”International Journal of Computer Vision (IJCV), 2013. [Online]. Available: https://www.robots.ox.ac.uk/~vgg/publications/2014/Marin14/marin14.pdf

work page 2013

[25] [25]

Laeo-net: Re- visiting people looking at each other in videos,

M. J. Marin-Jimenez, V . Kalogeiton, P. Medina-Suarez, and A. Zisserman, “Laeo-net: Re- visiting people looking at each other in videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [Online]. Avail- able: https://openaccess.thecvf.com/content_CVPR_2019/papers/Marin-Jimenez_LAEO-Net_Revisiting_ People_L...

work page 2019

[26] [26]

Boosting image- based mutual gaze detection using pseudo 3d gaze,

B. Doosti, C.-H. Chen, R. Vemulapalli, X. Jia, Y . Zhu, and B. Green, “Boosting image- based mutual gaze detection using pseudo 3d gaze,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, pp. 1273–1281, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16215

work page 2021

[27] [27]

Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty,

G. Cantarini, F. F. Tomenotti, N. Noceti, and F. Odone, “Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty,” 2021

work page 2021

[28] [28]

Attention flow: End-to-end joint attention estimation,

Ö. Sümer, P. Gerjets, U. Trautwein, and E. Kasneci, “Attention flow: End-to-end joint attention estimation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. [Online]. Available: https://openaccess.thecvf.com/content_W ACV_2020/papers/Sumer_Attention_Flow_ End-to-End_Joint_Attention_Estimation_W ACV_2020_paper.pdf

work page 2020

[29] [29]

Exploring the zero-shot capabilities of vision- language models for improving gaze following,

A. Gupta, P. Vuillecard, A. Farkhondeh, and J.-M. Odobez, “Exploring the zero-shot capabilities of vision- language models for improving gaze following,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2024, pp. 615–624

work page 2024

[30] [30]

Gazellm: a plug-and-play zero-shot llm reasoning framework for boosting gaze target detection,

Y . Yang and F. Lu, “Gazellm: a plug-and-play zero-shot llm reasoning framework for boosting gaze target detection,”Visual Intelligence, vol. 3, no. 1, p. 26, 2025

work page 2025

[31] [31]

Gazevlm: A vision-language model for multi-task gaze understanding,

A. M. Mathew, H. Hermassi, T. Khalid, and A. A. Khan, “Gazevlm: A vision-language model for multi-task gaze understanding,”arXiv preprint arXiv:2511.06348, 2025

work page arXiv 2025

[32] [32]

Vl4gaze: Unleashing vision-language models for gaze following,

S. Wang, C. Cui, Y . Huang, H. J. Chang, and Y . Cheng, “Vl4gaze: Unleashing vision-language models for gaze following,”arXiv preprint arXiv:2512.20735, 2025

work page arXiv 2025

[33] [33]

Chain- of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain- of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[34] [34]

Large Language Models are Zero-Shot Reasoners

T. Kojima, S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” arXiv preprint arXiv:2205.11916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Learn to explain: Multimodal reasoning via thought chains for science question answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[36] [36]

Multimodal chain-of-thought reasoning in language models,

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”Transactions on Machine Learning Research (TMLR), 2024. 11

work page 2024

[37] [37]

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Y . Wu, Z. Yang, J. Qian, S. Gao, G. Chen, Q. Li, Y .-A. Huang, and Z.-A. Huang, “Better eyes, better thoughts: Why vision chain-of-thought fails in medicine,”arXiv preprint arXiv:2603.06665, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Visual chain-of-thought prompting for knowledge-based visual reasoning,

Z.-F. Chenet al., “Visual chain-of-thought prompting for knowledge-based visual reasoning,” inProceed- ings of the AAAI Conference on Artificial Intelligence, 2024

work page 2024

[39] [39]

Chain of thought prompt tuning in vision language models,

J. Ge, H. Luo, S. Qian, Y . Gan, J. Fu, and S. Zhang, “Chain of thought prompt tuning in vision language models,”arXiv preprint arXiv:2304.07919, 2023

work page arXiv 2023

[40] [40]

Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,

B. Ji, S. Agrawal, Q. Tang, and Y . Wu, “Enhancing spatial reasoning in vision-language models via chain-of-thought prompting and reinforcement learning,”arXiv preprint arXiv:2507.13362, 2025

work page arXiv 2025

[41] [41]

mtgs-static-vsgaze,

A. Gupta, “mtgs-static-vsgaze,” 2026, accessed: 2026-05-06. [Online]. Available: https: //huggingface.co/Idiap/mtgs-static-vsgaze

work page 2026

[42] [42]

Multimodal across domains gaze target detection,

F. Tonini, C. Beyan, and E. Ricci, “Multimodal across domains gaze target detection,” inProceedings of the 2022 International Conference on Multimodal Interaction, 2022, pp. 420–431

work page 2022

[43] [43]

Gaze target estimation inspired by interactive attention,

Z. Hu, K. Zhao, B. Zhou, H. Guo, S. Wu, Y . Yang, and J. Liu, “Gaze target estimation inspired by interactive attention,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8524–8536, 2022

work page 2022

[44] [44]

Where are they looking in the 3d space?

N. Horanyi, L. Zheng, E. Chong, A. Leonardis, and H. J. Chang, “Where are they looking in the 3d space?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023, pp. 2678–2687

work page 2023

[45] [45]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Gpt-5.4 thinking system card,

OpenAI, “Gpt-5.4 thinking system card,” https://openai.com/index/gpt-5-4-thinking-system-card/, 2026

work page 2026

[47] [47]

Gemini 3.1 pro model card,

Google DeepMind, “Gemini 3.1 pro model card,” https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

work page 2026

[48] [48]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. Available: http...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[50] [50]

Austism diagnostic observation schedule: A standardized observation of communicative and social behavior,

C. Lord, M. Rutter, S. Goode, J. Heemsbergen, H. Jordan, L. Mawhood, and E. Schopler, “Austism diagnostic observation schedule: A standardized observation of communicative and social behavior,” Journal of autism and developmental disorders, vol. 19, no. 2, pp. 185–212, 1989

work page 1989

[51] [51]

Autism Diagnostic Observation Schedule, Second Edition,

Lord, Rutter, DiLavore, Risi, Gotham, and Bishop, “Autism Diagnostic Observation Schedule, Second Edition,” 2012. 12 A Implementation Details We fine-tune representative Qwen-based VLMs on the constructed QA pairs using LLaMA- Factory [47]. Specifically, LoRA adapters are inserted into both the self-attention projection layers (q_proj, k_proj, v_proj, and...

work page 2012

[52] [53]

Identify the activity or activities in which the person of interest is engaged in

work page

[53] [54]

Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

Determine the direction of gaze of the person of interest. Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

work page

[54] [55]

Evaluate as well whether the person of interest is looking at something that is within the image area, or outside the image area

Identify and locate all plausible gaze targets (objects and/or persons) in the image that are visible to the person of interest, that is, that are within their field-of-view. Evaluate as well whether the person of interest is looking at something that is within the image area, or outside the image area

work page

[55] [56]

Describe the spatial relationships between the person of interest and each of these plausible gaze targets

work page

[56] [57]

Based on the scene context and information above, estimate how possible the person of interest is looking inside the frame and which plausible gaze target is the person of interest most likely looking at regardless of whether it is inside or outside the frame

work page

[57] [58]

inout": <0.0-1.0>,

Using your reasoning, localize the most probable target of attention of the person of interest as a point. Use this point as the final gaze point. Provide the requested outputs (inout and gaze_point). ### Input/Output Format: - **Input:** An image followed by a set of **bounding box coordinates** of the head of the person of interest. - **Output:** Report...

work page

[58] [59]

The person of interest is the woman on the left, identified by the provided head bounding box

The image shows a café or casual dining setting with two people sitting at a table. The person of interest is the woman on the left, identified by the provided head bounding box

work page

[59] [60]

Her head and upper body are turned toward the right side of the image, which indicates that her attention is directed toward the man beside her or toward an object near him

work page

[60] [61]

This makes the interaction area on the right side of the image the most likely region of attention

Her visible facial orientation suggests a rightward gaze rather than a downward look at the table or an upward look toward the background shelves. This makes the interaction area on the right side of the image the most likely region of attention

work page

[61] [62]

The cup is the stronger candidate because it is visually salient, close to her line of sight, and central to the man’s ongoing action, while his face is partly occluded by the cup

Within that rightward region, the most plausible gaze targets are the man’s face and the large white cup he is holding. The cup is the stronger candidate because it is visually salient, close to her line of sight, and central to the man’s ongoing action, while his face is partly occluded by the cup. By contrast, other objects such as the glass on the tabl...

work page

[62] [63]

inout": 1.0,

Therefore, the most likely gaze target is the large white cup on the right side of the image. Since this target is clearly inside the frame, the probability of looking inside the image should be high. ### Gaze Point ### ```json [ {"inout": 1.0, "gaze_point": [0.81, 0.24]} ] ``` user: <image>[0.333, 0.056, 0.498, 0.221] 20 assistant: ### Reasoning ###

work page

[63] [64]

The person of interest is the player in red at the center, identified by the given head bounding box

The image shows a field sport scene with several players. The person of interest is the player in red at the center, identified by the given head bounding box

work page

[64] [65]

His face orientation suggests that his attention is not on the ground directly below him or on the distant players in the background

His body is moving forward, but his head is turned toward the left side of the image. His face orientation suggests that his attention is not on the ground directly below him or on the distant players in the background

work page

[65] [66]

It is close to him, in front of his hurley, and consistent with the ongoing action

The most visually salient object in the direction of his head and eyes is the small ball on the left side of the image. It is close to him, in front of his hurley, and consistent with the ongoing action

work page

[66] [67]

The ball is much more consistent with both the sports context and his immediate action

Other possible targets, such as the nearby teammate behind him or the player in blue on the right, are less likely because they are not aligned with his apparent line of sight. The ball is much more consistent with both the sports context and his immediate action

work page

[67] [68]

The ball is clearly inside the frame, so the probability that he is looking inside the image is very high

work page

[68] [69]

inout": 1.0,

Therefore, the most plausible gaze target is the ball on the left side of the image, with the gaze point placed near its center. ### Gaze Point ### ```json [ {"inout": 1.0, "gaze_point": [0.14, 0.23]} ] ``` Example provided. Now, analyze the following image. K Zero-Shot Prompts for SG For simplicity, we use <Task description> to denote the description of ...

work page

[69] [70]

Describe what is happening in the image

work page

[70] [71]

Identify the activity or activities in which the people of interest are engaged in

work page

[71] [72]

Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

Determine the direction of gaze of the people of interest. Analyze the orientation of their head and eyes (if visible) to infer their gaze direction

work page

[72] [73]

Identify and locate all plausible gaze targets (objects and/or persons) in the image that are visible to the people of interest, that is, that are within their field-of-view

work page

[73] [74]

Describe the spatial relationships between the people of interest and each of these plausible gaze targets

work page

[74] [75]

Based on the scene context and information above, determine whether the people of interest are engaged in social gaze

work page

[75] [76]

### Input/Output Format: - **Input:** An image followed by a pair of **bounding box coordinates** of the heads of the people of interest

Using your reasoning, estimate the probability of <task> for the people of interest. ### Input/Output Format: - **Input:** An image followed by a pair of **bounding box coordinates** of the heads of the people of interest. - **Output:** Report in JSON format: - **label** the probability of <task>. **Required Output Format:** ### Reasoning ### <Your step-b...

work page

[76] [77]

The two people are a woman in the center and a child to her lower left, sitting together on a bed with others around them

work page

[77] [78]

They appear to be interacting in a group conversation or family scene

work page

[78] [79]

The child’s head is tilted upward toward the woman, so the child is looking at the woman’s face/head

work page

[79] [80]

The woman’s head is turned slightly down and left toward the child, indicating her gaze is directed at the child

work page

[80] [81]

Other plausible gaze targets exist in the room, but both people’s faces are oriented toward each other more than toward anyone else

work page