pith. sign in

arxiv: 2506.05412 · v3 · submitted 2025-06-04 · 💻 cs.CV · cs.CL

Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

Pith reviewed 2026-05-19 10:47 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords gaze directionvision-language modelshead orientationeye appearancenonverbal cuesperformance gapfinetuning
0
0 comments X

The pith

Vision-language models rely on head orientation rather than eye appearance to infer gaze direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well vision-language models can tell where a person is looking in real photos. The authors made 1,360 pictures of someone at a table gazing at objects while deliberately varying whether the head pointed at the target, at a distractor, or was left free. Models fell well behind human performance, and the gap appeared mainly because the models read the head direction instead of the eyes. This matters for any AI meant to respond to people using natural nonverbal signals such as gaze. A finetuning test suggests the bias comes from the data the models were trained on rather than from their architecture.

Core claim

By photographing 1,360 scenes in which a person gazes at one of several table objects while head orientation is directed toward the target, toward a distractor, or left unconstrained, the work shows that vision-language models infer gaze direction primarily from head orientation rather than from eye appearance, producing errors exactly when the two cues conflict.

What carries the argument

A custom dataset of 1,360 real-world photos that varies head orientation independently of gaze target to isolate reliance on head pose versus eye appearance.

If this is right

  • VLMs will err on gaze inference whenever head direction conflicts with the true eye direction.
  • The source of the error lies in the statistics of existing training data rather than in model architecture.
  • Targeted finetuning on data that highlights eye cues can begin to reduce the reliance on head orientation.
  • Technologies that interpret gaze targets will need to overcome this bias to support efficient, natural interactions with humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same head-pose shortcut may affect other fine-grained visual judgments that require distinguishing coarse body cues from detailed facial features.
  • Datasets containing deliberate mismatches between head pose and eye direction could be used to retrain models and measure whether the bias generalizes across architectures.
  • These results suggest a broader need for training data that emphasizes eye-level signals in any social-scene understanding task.

Load-bearing premise

The controlled variation in head orientation across the 1,360 photos isolates the contribution of head direction from eye appearance without introducing other uncontrolled visual cues that the models might exploit.

What would settle it

If a vision-language model given only the eye region (head masked or cropped) reaches human-level accuracy on the same gaze targets, this would indicate that eye appearance is sufficient and head orientation is not the dominant cue used.

Figures

Figures reproduced from arXiv: 2506.05412 by Bingyang Wang, Dezhi Luo, Hokin Deng, Pinyuan Feng, Qingying Gao, Suyang Yu, Tianwei Zhao, Yijiang Li, Ziqiao Ma, Zory Zhang.

Figure 1
Figure 1. Figure 1: (a) The gaze referential inference task. (b) 99.9% confidence intervals of the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Systematic manipulation of View (left/right/front), Proximity (1-3 scale), #Objects (2-4), Objects (18 combinations of 9 distinct items), and Gazer (2 actors) across 900 test stimuli. Stimuli in subfigure (c) have a Proximity value of 2. Here Gazer=ActorX. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A row in a confusion matrix indicates the proportion of trials across all 111 VLMs (or [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: No strong linear relation between VLM accuracy and release date was found. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The 95% confidence intervals for linear regression are drawn as shaded areas. Standard [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The estimated marginal means. The random-guessing baseline is indicated by dashed lines. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of stimuli for Actor Y with different [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: All 9 objects in the stimulus pool with different sizes and visual salience. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Number of stimuli nested within View, Proximity, and Objects. The brackets in row names denote whether the gazer is Actor X or Y for the combination of Objects. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The demographics. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prediction made by a logistic regression model (response time shown here as the 99th [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Screenshots from the human survey. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Full comparison of the overall accuracy of Humans and VLMs. 95% CIs are drawn [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Confusion matrices. Each row corresponds to the ground truth object, and each column [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Per-stimulus accuracy and distribution comparison between VLMs and human participants. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The distribution of VLM responses in terms of the options, combining trials in Analysis A [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: X axis represents stimuli. Blue cells are trials where the response is incorrect. Rows are [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Each row depicts all the data points we collect from a human participant (who passed all [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Each 3 by 3 matrix is a summary of a participant’s performance. The rows from top [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: VLM Accuracy nested within View, Proximity, and Objects. Based on responses from the five top-tier VLMs. Dashed lines represent the random-guessing baseline. 95% CI drawn. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
read the original abstract

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper evaluates Vision-Language Models (VLMs) on inferring gaze targets from 1,360 real-world photos of a person gazing at objects on a table. Head orientation is controlled across three conditions (toward target, toward distractor, unconstrained) while collecting human baselines. The authors report a substantial VLM-human performance gap, rule out confounds such as resolution and object naming, and attribute the gap to VLMs relying on head orientation rather than eye appearance. A proof-of-concept finetuning experiment on a transformer vision model is included to suggest the bias is data-driven rather than architectural.

Significance. If the attribution to head-orientation reliance holds after tighter controls, the result is significant for human-AI interaction research: it isolates a concrete failure mode in processing nonverbal cues and points to data curation as a remedy. The use of real photos with explicit condition variation and human comparison strengthens the empirical contribution over purely synthetic benchmarks.

major comments (1)
  1. [Abstract / Stimulus Construction] Abstract and stimulus-construction description: the central claim that VLMs infer gaze from head orientation rather than eye appearance requires that the three head-orientation conditions vary head direction while holding eye appearance and all other visual features fixed. In real photographs, turning the head toward a distractor while directing gaze at the target necessarily alters body angle, facial shadowing, and the projected shape of the eyes relative to the camera. No measurements or additional controls confirming that remaining variance is carried only by the intended head-orientation cue are reported, weakening the causal attribution.
minor comments (1)
  1. [Abstract] The abstract states that resolution and object-naming confounds were ruled out but does not specify the exact tests, stimuli, or statistical thresholds used; adding these details would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the empirical contribution of our real-world stimulus set and human baselines. We address the major comment on stimulus construction below, agreeing that additional quantification would strengthen the causal claims, and outline the planned revisions.

read point-by-point responses
  1. Referee: [Abstract / Stimulus Construction] Abstract and stimulus-construction description: the central claim that VLMs infer gaze from head orientation rather than eye appearance requires that the three head-orientation conditions vary head direction while holding eye appearance and all other visual features fixed. In real photographs, turning the head toward a distractor while directing gaze at the target necessarily alters body angle, facial shadowing, and the projected shape of the eyes relative to the camera. No measurements or additional controls confirming that remaining variance is carried only by the intended head-orientation cue are reported, weakening the causal attribution.

    Authors: We thank the referee for highlighting this important methodological point. In our stimulus construction, a single gazer was photographed in a fixed real-world setting with consistent camera position and lighting; the individual was explicitly instructed to maintain fixation on the target object while head orientation was varied across the three conditions (toward target, toward distractor, unconstrained). This produces the key contrast between aligned and misaligned head-gaze configurations. We acknowledge that real photographs necessarily introduce some correlated changes in body angle, shadowing, and eye projection due to perspective. To address the concern directly, the revised manuscript will expand the methods section with a fuller description of the photography protocol (including posture standardization attempts) and will add quantitative measurements of eye-region consistency across conditions, for example via landmark-based similarity metrics or iris-position variance. These additions will help demonstrate that head orientation remains the primary manipulated factor while preserving the ecological validity of naturalistic images over fully synthetic controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with independent stimuli

full rationale

The paper is an empirical evaluation that collects a new set of 1,360 controlled photographs varying head orientation while measuring VLM performance on gaze target inference. No equations, fitted parameters, or derivations are present that reduce to self-referential inputs. The central claim rests on direct comparison of model outputs against human baselines using the newly constructed stimuli, with alternative explanations (resolution, object naming) explicitly ruled out by the experimental design. This is self-contained against external benchmarks and contains no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical axioms, free parameters, or invented entities. It relies on standard assumptions that the collected photographs are representative of real-world scenes and that human performance on the same stimuli provides a valid upper bound.

axioms (1)
  • domain assumption Photographs with controlled head orientation accurately isolate the visual cue of head direction from eye appearance.
    Invoked when attributing the performance gap specifically to head-orientation bias rather than other uncontrolled factors.

pith-pipeline@v0.9.0 · 5761 in / 1265 out tokens · 37573 ms · 2026-05-19T10:47:37.388821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48. (Back to section A.12) Brooks,R.andMeltzoff,A.N.(2005). Thedevelopmentofgazefollowinganditsrelationtolanguage. Developmental Science, 8(6):535–543. (Back to sections: 1,

  2. [2]

    J., Butler, S., and Brooks, R

    Caron, A. J., Butler, S., and Brooks, R. (2002). Gaze following at 12 and 14 months: Do the eyes matter? British Journal of Developmental Psychology, 20(2):225–239. (Back to section

  3. [3]

    and Gergely, G

    Csibra, G. and Gergely, G. (2009). Natural pedagogy.Trends in Cognitive Sciences, 13(4):148–153. (Back to section

  4. [4]

    R., Gilbert, R

    de Leeuw, J. R., Gilbert, R. A., and Luchterhandt, B. (2023). jspsych: Enabling an open-source collaborative ecosystem of behavioral experiments.Journal of Open Source Software, 8(85):5351. (Back to section 3.4) Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., Zhang, W., Li, Y., Yan, H., Gao, Y., Zhang, ...

  5. [5]

    Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al. (2024). Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedingsofthe32ndACMInternationalConferenceonMultimedia ,pages11198–11201. (Back to section 3.3) Farroni, T., Massaccesi, S., Pividori, D., and Johnson, M. H...

  6. [6]

    Geirhos, R., Meding, K., and Wichmann, F. A. (2020). Beyond accuracy: quantifying trial-by- trial behaviour of cnns and humans by measuring error consistency. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc. (Back to section

  7. [7]

    Gemini, T. et al. (2023). Gemini: A family of highly capable multimodal models. (Back to section

  8. [8]

    Gemini, T. et al. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (Back to section

  9. [9]

    GLM, T. et al. (2024). Chatglm: A family of large language models from glm-130b to glm-4 all tools. (Back to section

  10. [10]

    M., Hirsh-Pasek, K., Cauley, K

    Golinkoff, R. M., Hirsh-Pasek, K., Cauley, K. M., and Gordon, L. (1987). The eyes have it: lexical and syntactic comprehension in a new paradigm.Journal of Child Language, 14(1):23–45. (Back to section

  11. [11]

    H., Lloyd-Fox, S., Blasi, A., Deligianni, F., Elwell, C., and Csibra, G

    Grossmann, T., Johnson, M. H., Lloyd-Fox, S., Blasi, A., Deligianni, F., Elwell, C., and Csibra, G. (2008). Earlycorticalspecializationforface-to-facecommunicationinhumaninfants. Proceedings of the Royal Society B: Biological Sciences, 275(1653):2803–2811. (Back to section

  12. [12]

    Gupta, A., Vuillecard, P., Farkhondeh, A., and Odobez, J.-M. (2024). Exploring the zero-shot capabilities of vision-language models for improving gaze following. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 615–624. (Back to sections: 2,

  13. [13]

    X., Wang, W

    Han, N. X., Wang, W. Y., and Eckstein, M. P. (2021). Gaze perception in humans and cnn-based model. (Back to sections: 2, 4, and

  14. [14]

    Kambhampati, S. (2024). Can large language models reason and plan?Annals of the New York Academy of Sciences, 1534(1):15–18. (Back to sections: 1,

  15. [15]

    H., Holler, J., and Levinson, S

    11 Kendrick, K. H., Holler, J., and Levinson, S. C. (2023). Turn-taking in human face-to-face interaction is multimodal: gaze direction and manual gestures aid the coordination of turn transitions.Philo- sophical Transactions of the Royal Society B: Biological Sciences, 378(1875):20210473. (Back to section

  16. [16]

    Kosinski, M. (2024). Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121. (Back to section

  17. [17]

    Land, M. F. (2006). Eye movements and the control of actions in everyday life.Progress in Retinal and Eye Research, 25(3):296–324. (Back to section

  18. [18]

    Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual instruction tuning. (Back to section

  19. [19]

    Ma, Z., Sansom, J., Peng, R., and Chai, J. (2023). Towards a holistic landscape of situated theory of mind in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1011–1031. (Back to section

  20. [20]

    R., Brooks, R., and Meltzoff, A

    Mikulincer, M., Shaver, P. R., Brooks, R., and Meltzoff, A. N. (2014).Gaze following: A mechanism for building social connections between infants and adults.American Psychological Association„ Washington, DC, first edition. edition. (Back to sections: 1,

  21. [21]

    Mixed-effectsmodelsforcognitivedevelopment researchers

    Muradoglu,M.,Cimpian,J.R.,andand,A.C.(2023). Mixed-effectsmodelsforcognitivedevelopment researchers. Journal of Cognition and Development, 24(3):307–340. (Back to section

  22. [22]

    Gpt-4o system card

    OpenAI (2024). Gpt-4o system card. (Back to section

  23. [23]

    K., and Jones, C

    Pi, Z., Vadaparty, A., Bergen, B. K., and Jones, C. R. (2025). Dissecting the ullman variations with a scalpel: Why do llms fail at trivial alterations to the false belief task? (Back to section

  24. [24]

    Posner, M. I. and Petersen, S. E. (1990). The attention system of the human brain.Annual Review of Neuroscience, 13(Volume 13, 1990):25–42. (Back to section

  25. [25]

    and Chai, J

    Prasov, Z. and Chai, J. Y. (2008). What’s in a gaze? the role of eye-gaze in reference resolution in multimodal conversational interfaces. InProceedings of the 13th international conference on Intelligent user interfaces, pages 20–29. (Back to sections: 1,

  26. [26]

    and Woodruff, G

    Premack, D. and Woodruff, G. (1978). Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526. (Back to section

  27. [27]

    Qian, K., Zhang, Z., Song, W., and Liao, J. (2023). Gvgnet: Gaze-directed visual grounding for learning under-specified object referring intention.IEEE Robotics and Automation Letters, 8(9):5990–5997. (Back to sections: 1,

  28. [28]

    Qwen, T. et al. (2025). Qwen2.5-vl. (Back to section

  29. [29]

    Robitzsch, A. (2020). Why ordinal variables can (almost) always be treated as continuous vari- ables: Clarifying assumptions of robust continuous and ordinal factor analysis estimation methods. Frontiers in Education, Volume 5 -

  30. [30]

    Whydoesjointattention predict vocabulary acquisition? the answer depends on what coding scheme you use.Proceedings of the Annual Meeting of the Cognitive Science Society, 46(0)

    Sander,J.,Çetinçelik,M.,Zhang,Y.,Rowland,C.F.,andHarmon,Z.(2024). Whydoesjointattention predict vocabulary acquisition? the answer depends on what coding scheme you use.Proceedings of the Annual Meeting of the Cognitive Science Society, 46(0). (Back to section

  31. [31]

    Sap, M., Le Bras, R., Fried, D., and Choi, Y. (2022). Neural theory-of-mind? on the limits of social intelligence in large LMs. In Goldberg, Y., Kozareva, Z., and Zhang, Y., editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3762–3780, Abu Dhabi, United Arab Emirates. Association for Computational Linguis...

  32. [32]

    and Csibra, G

    Senju, A. and Csibra, G. (2008). Gaze following in human infants depends on communicative signals. Current Biology, 18(9):668–671. (Back to section

  33. [33]

    for the 5 selected VLMs are excluded to preserve equal sample sizes among VLMs for fair comparison. 40% 50% 60% 70% 80% 90% Humans o1-2024-12-17 gemini-1.5-pro gpt-4o Qwen2.5-VL-72B-Instruct glm-4v-9b internlm-xcomposer2d5-7b internlm-xcomposer2-vl-7b NVLM-D-72B emu2-chat LLaVA-NeXT-Video-32B-Qwen emu3-chat llava-onevision-qwen2-7b-ov-chat-hf internlm-xco...

  34. [34]

    Based on responses from the five top-tier VLMs

    front left rightView Proximity: 1 Proximity: 2 water+coffee (X) Proximity: 3 front left rightView coffee+remote (X) front left rightView doll+water (X) front left rightView pen+remote (X) front left rightView pen+soccer (X) front left rightView pen+yellow eraser (X) front left rightView soccer+doll (X) front left rightView white eraser+pen (X) front left ...