pith. sign in

arxiv: 2605.04770 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.HC· cs.LG· cs.RO

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.LGcs.RO
keywords gaze estimationhuman-robot interactionzero-shot robustnessbenchmark datasetdata diversityappearance-based methodsHRI conditions
0
0 comments X

The pith

All tested gaze estimation networks fail in at least one human-robot interaction condition, with data diversity proving the main source of robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Gaze4HRI dataset to test appearance-based gaze estimators under realistic human-robot conditions that existing benchmarks ignore. It shows every current method breaks down somewhere in the new test set, with steeply downward gaze directions creating a failure shared by all approaches. One exception stands out: the PureGaze model trained on the broad ETH-X-Gaze collection handles every other condition without collapse. The authors conclude that the variety of training images matters more for reliable zero-shot performance than the addition of transformers, temporal modeling, or other architectural refinements. This gives practitioners a concrete way to pick models likely to work when robots must track eyes during movement and changing light.

Core claim

Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a a

What carries the argument

The Gaze4HRI dataset and evaluation protocol, which systematically varies illumination, head-gaze conflicts, and the motion of both camera and gaze target across thousands of video sequences to measure zero-shot robustness.

If this is right

  • Steeply downward gaze directions require targeted attention in future model design or data collection for HRI applications.
  • Training on large and varied gaze datasets delivers broader robustness than adding spatial-temporal layers or transformer blocks.
  • Self-adversarial training losses that purify gaze features can deliver further gains once a diverse base dataset is used.
  • Practitioners should prioritize models trained on diverse collections such as ETH-X-Gaze when selecting estimators for moving-camera robot settings.
  • Future benchmarks must include dynamic camera and target motion to avoid overestimating real-world performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If data diversity drives robustness, then deliberately expanding existing gaze collections with more extreme angles and lighting could remove the remaining universal failure mode.
  • The same benchmarking approach could be applied to other robotic vision tasks to check whether architectural complexity is similarly overvalued relative to data coverage.
  • Deployments involving downward camera angles relative to people would need either special handling or fallback strategies until the downward-gaze problem is solved.
  • The universal failure on steep downward looks may point to a shared limitation in how current networks process eye appearance when the iris and pupil are partially occluded by the eyelid.

Load-bearing premise

The specific variables captured in Gaze4HRI capture the most important real-world difficulties that gaze estimators will face when deployed on robots.

What would settle it

A new method that succeeds on steeply downward gazes within the Gaze4HRI videos while using a simple non-transformer architecture and a smaller training set would disprove the claim that data diversity is the dominant factor.

Figures

Figures reproduced from arXiv: 2605.04770 by Ali G\"orkem K\"u\c{c}\"uk, Berk Sezer, Erol \c{S}ahin, Sinan Kalkan.

Figure 1
Figure 1. Figure 1: (a) We introduce Gaze4HRI, an extensive benchmark which includes high-quality gaze recordings for 50+ subjects, view at source ↗
Figure 3
Figure 3. Figure 3: Eye-head calibration for gaze ground truth collection: view at source ↗
Figure 2
Figure 2. Figure 2: Experimental Setup. The user is instructed to look view at source ↗
Figure 4
Figure 4. Figure 4: The four setups used in our analysis. the subject in this experiment. For each method and level, we compute subject-level mean angular error by averaging a subject’s video-level means6 , as shown in Table III. Results: Method rankings by illumination. The subject￾level mean angular errors are provided in Table III. Pairwise within-subject t-tests (Holm-corrected, α = 0.05) show that PureGaze (E) and GazeTR… view at source ↗
Figure 5
Figure 5. Figure 5: Gaze targets on the shared table, for the object view at source ↗
Figure 6
Figure 6. Figure 6: Exp. 1: Illustration of different illumination levels. view at source ↗
Figure 7
Figure 7. Figure 7: Experiment 3: Samples for low (a) and high (b) levels view at source ↗
read the original abstract

While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Gaze4HRI, a new large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) for zero-shot benchmarking of appearance-based 3D gaze estimation methods under HRI-relevant conditions including illumination variation, head-gaze conflict, and motion of camera and gaze targets in video. Evaluation of multiple state-of-the-art methods shows that all fail in at least one condition, with steeply downward gaze as a universal failure mode; PureGaze trained on the ETH-X-Gaze dataset is the only one resilient across other conditions. The authors conclude that training data diversity is the primary driver of zero-shot robustness, challenging the emphasis on complex spatial-temporal or Transformer architectures.

Significance. If the empirical findings hold after addressing confounds, the work provides a useful new benchmark focused on practical HRI deployment gaps and offers actionable guidance that data curation may yield higher returns than architectural elaboration for generalization. Public release of the dataset and code supports reproducibility and follow-on work in robotics.

major comments (1)
  1. [Abstract and Results/Discussion] The claim in the abstract that 'extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments' (and the related challenge to Transformer-based architectures) is not load-bearing without controlling for training data. The results highlight PureGaze trained on ETH-X-Gaze as uniquely resilient, but other evaluated methods are not reported with equivalent training on the same diverse corpus; this confounds data volume/diversity with model architecture and self-adversarial loss, preventing unambiguous attribution. A rephrasing of the conclusion or an ablation (e.g., reporting original training sets or cross-training) is needed in the Results and Discussion sections.
minor comments (1)
  1. [Dataset description] Clarify in the dataset section the precise distribution of frames across each HRI variable (illumination, head-gaze conflict, motions) and any post-collection filtering criteria to allow readers to assess coverage and potential selection effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: [Abstract and Results/Discussion] The claim in the abstract that 'extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments' (and the related challenge to Transformer-based architectures) is not load-bearing without controlling for training data. The results highlight PureGaze trained on ETH-X-Gaze as uniquely resilient, but other evaluated methods are not reported with equivalent training on the same diverse corpus; this confounds data volume/diversity with model architecture and self-adversarial loss, preventing unambiguous attribution. A rephrasing of the conclusion or an ablation (e.g., reporting original training sets or cross-training) is needed in the Results and Discussion sections.

    Authors: We agree that the current presentation risks confounding training data characteristics with model architecture. All methods were evaluated using their publicly released pre-trained weights (or standard training protocols from their original papers), which is the conventional zero-shot setup. ETH-X-Gaze is the largest and most diverse corpus among those represented, and PureGaze trained on it is the only model that remains robust outside the downward-gaze failure mode shared by all methods. This pattern is consistent with our broader claim that data diversity matters, yet we acknowledge the referee's point that a controlled cross-training study would be required for unambiguous causal attribution. We will therefore revise the abstract and the Results/Discussion sections to rephrase the conclusion more cautiously: the findings will be presented as an empirical observation that 'suggests training-data diversity plays a central role' rather than asserting it as the 'primary driver' without qualification. We will also add an explicit limitations paragraph noting the absence of cross-training ablations and the consequent need for future work to disentangle data and architecture effects. This revision addresses the concern without requiring new experiments beyond the scope of the current benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with independent external models and datasets

full rationale

The paper introduces Gaze4HRI as a new evaluation benchmark and reports performance of pre-existing methods (including PureGaze trained on the external ETH-X-Gaze corpus) across HRI conditions. All central claims—universal failure on steeply-downward gaze, relative resilience of one configuration, and the inference that data diversity outweighs architectural complexity—are direct summaries of observed metrics on held-out test videos. No equations, fitted parameters, or self-referential definitions appear; the derivation chain consists of standard cross-dataset evaluation steps that do not reduce to quantities defined inside the paper itself. Self-citations, if present, are not load-bearing for the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmarking study. It introduces no new mathematical axioms, free parameters, or invented physical entities; all claims rest on standard computer-vision evaluation practices and the representativeness of the collected videos.

pith-pipeline@v0.9.0 · 5633 in / 1157 out tokens · 24516 ms · 2026-05-08T16:55:11.269591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    A. A. Abdelrahman, T. Hempel, A. Khalifa, A. Al-Hamadi, and L. Dinges. L2cs-net : Fine-grained gaze estimation in unconstrained environments. In2023 8th International Conference on Frontiers of Signal Processing (ICFSP), pages 98–102, 2023

  2. [2]

    Admoni and B

    H. Admoni and B. Scassellati. Social eye gaze in human-robot interaction: A review.Journal of Human-Robot Interaction, 6:25–63, 03 2017

  3. [3]

    Balim, S

    H. Balim, S. Park, X. Wang, X. Zhang, and O. Hilliges. Efe: End-to- end frame-to-gaze estimation. pages 2688–2697, 06 2023

  4. [4]

    M. Cai, F. Lu, and Y . Sato. Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14392–14401, 2020

  5. [5]

    Cheng, Y

    Y . Cheng, Y . Bao, and F. Lu. Puregaze: Purifying gaze feature for generalizable gaze estimation.Proceedings of the AAAI Conference on Artificial Intelligence, 2022

  6. [6]

    Cheng and F

    Y . Cheng and F. Lu. Gaze estimation using transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 3341–3347, 2022

  7. [7]

    Cheng, H

    Y . Cheng, H. Wang, Y . Bao, and F. Lu. Appearance-based gaze estimation with deep learning: A review and benchmark.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):7509–7528, Dec. 2024

  8. [8]

    S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually vanishing bridge for adversarial domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 12455–12464, 2020

  9. [9]

    Fischer, H

    T. Fischer, H. J. Chang, and Y . Demiris. Rt-gene: Real-time eye gaze estimation in natural environments. In V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, editors,Computer Vision – ECCV 2018, pages 339–357, Cham, 2018. Springer International Publishing

  10. [10]

    Fischer-Janzen, M

    A. Fischer-Janzen, M. Zhang, N. Lan, M. R. Yuce, M. Abdel-Malek, N. V . Thakor, W. D. Hairston, J. S. Bayouth, and H. Huang. A scoping review of gaze and eye tracking–based control for assistive robotics. Frontiers in Robotics and AI, 10:1302450, 2024

  11. [11]

    K. A. Funes Mora, F. Monay, and J.-M. Odobez. Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. InProceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’14, page 255–258, New York, NY , USA, 2014. Association for Computing Machinery

  12. [12]

    Ghosh, A

    R. Ghosh, A. Dutta, and J. Matas. Automatic gaze analysis: A survey of deep learning based approaches.Computer Vision and Image Understanding, 214:103313, 2022

  13. [13]

    Y . Guan, Z. Chen, W. Zeng, Z.-G. Cao, and Y . Xiao. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context.IEEE Signal Processing Letters, PP:1–5, 01 2023

  14. [14]

    Z. Guo, Z. Yuan, C. Zhang, W. Chi, Y . Ling, and S. Zhang. Domain adaptation gaze estimation by embedding with prediction consistency. In H. Ishikawa, C.-L. Liu, T. Pajdla, and J. Shi, editors,Computer Vision, pages 292–307, Cham, 2021. Springer International Publishing

  15. [15]

    Jianfeng and L

    L. Jianfeng and L. Shigang. Eye-model-based gaze estimation by rgb- d camera. In2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 606–610, 2014

  16. [16]

    social gaze space

    M. Jording, A. Hartz, G. Bente, M. Schulte-R ¨uther, and K. V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in Psychology, 09:226, 02 2018

  17. [17]

    Kellnhofer, A

    P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Tor- ralba. Gaze360: Physically unconstrained gaze estimation in the wild. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6911–6920, 2019

  18. [18]

    Kompatsiari, V

    K. Kompatsiari, V . Tikhanoff, F. Ciardo, G. Metta, and A. Wykowska. The importance of mutual gaze in human-robot interaction. InSocial Robotics, pages 443–452. Springer, 2017

  19. [19]

    Krafka, A

    K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2176–2184, 2016

  20. [20]

    Lanillos, J

    P. Lanillos, J. F. Ferreira, and J. Dias. A bayesian hierarchy for robust gaze estimation in human–robot interaction.International Journal of Approximate Reasoning, 87:1–22, 2017

  21. [21]

    L. Lin, Z. Wu, Y . Lu, Z. Chen, and W. Guo. Recent progress on eye-tracking and gaze estimation for ar/vr applications: A review. Electronics, 14(17):3352, 2025

  22. [22]

    Y . Liu, R. Liu, H. Wang, and F. Lu. Generalizing gaze estimation with outlier-guided collaborative adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3835–3844, 2021

  23. [23]

    OptiTrack camera comparison and technical speci- fications, 2024

    NaturalPoint, Inc. OptiTrack camera comparison and technical speci- fications, 2024

  24. [24]

    Panagou, W

    S. Panagou, W. P. Neumann, and F. Fruggiero. A scoping review of human robot interaction research towards industry 5.0 human-centric workplaces.International Journal of Production Research, 62(3):974– 990, 2024

  25. [25]

    Pathirana, B

    S. Pathirana, B. Shrestha, and J. Lee. Eye gaze estimation: A survey on deep learning-based approaches.IEEE Access, 10:99741–99764, 2022

  26. [26]

    D. Qi, W. Tan, Q. Yao, and J. Liu. Yolo5face: Why reinventing a face detector, 05 2021

  27. [27]

    Schreiter, T

    T. Schreiter, T. Rodrigues de Almeida, Y . Zhu, E. Guti ´errez Maestro, L. Morillo-M ´endez, A. Rudenko, L. Palmieri, T. Kucner, M. Mag- nusson, and A. Lilienthal. Th ¨Or-magni: A large-scale indoor motion capture recording of human movement and robot interaction.The International Journal of Robotics Research, 44, 10 2024

  28. [28]

    Schreiter, A

    T. Schreiter, A. Rudenko, M. Magnusson, and A. J. Lilienthal. Human gaze and head rotation during navigation, exploration and object manipulation in shared environments with robots. In2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 1258–1265, 2024

  29. [29]

    P. K. Sharma and P. Chakraborty. A review of driver gaze estimation and application in gaze behavior understanding.Engineering Appli- cations of Artificial Intelligence, 133:108117, 2024

  30. [30]

    Sugano, Y

    Y . Sugano, Y . Matsushita, and Y . Sato. Learning-by-synthesis for appearance-based 3D gaze estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1821–1828, 2014

  31. [31]

    Thepsoonthorn, K.-i

    C. Thepsoonthorn, K.-i. Ogawa, and Y . Miyake. The relationship between robot’s nonverbal behaviour and human’s likability based on human’s personality.Scientific Reports, 8, 05 2018

  32. [32]

    Tzeng, J

    E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discrim- inative domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167–7176, 2017

  33. [33]

    Universal Robots.UR5 Technical Specifications, 2024

  34. [34]

    Vuillecard and J.-M

    P. Vuillecard and J.-M. Odobez. Enhancing 3d gaze estimation in the wild using weak supervision with gaze following labels. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13508–13518, 2025

  35. [35]

    X. Wang, J. Zhang, H. Zhang, S. Zhao, and H. Liu. Vision-based gaze estimation: A review.IEEE Transactions on Cognitive and Developmental Systems, 14(2):316–332, 2022

  36. [36]

    Zhang, S

    X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges. Eth- xgaze: A large-scale dataset for gaze estimation under extreme head pose and gaze variation. InProc. European Conference on Computer Vision (ECCV), 2020

  37. [37]

    Zhang, Y

    X. Zhang, Y . Sugano, and A. Bulling. Revisiting data normalization for appearance-based gaze estimation. InProceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, ETRA ’18, pages 1–9, New York, NY , USA, 2018. Association for Computing Machinery

  38. [38]

    Zhang, Y

    X. Zhang, Y . Sugano, M. Fritz, and A. Bulling. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. InIEEE Transac- tions on Pattern Analysis and Machine Intelligence, volume 41, pages 162–175, 2019

  39. [39]

    Zhang, P

    Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang. Bytetrack: Multi-object tracking by associating every detection box. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. APPENDIX A. Ground-Truth Validation The accuracy of our motion-capture system (OptiTrack) has been validated by tracking the UR5 end-effect...