Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction
Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3
The pith
All tested gaze estimation networks fail in at least one human-robot interaction condition, with data diversity proving the main source of robustness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a a
What carries the argument
The Gaze4HRI dataset and evaluation protocol, which systematically varies illumination, head-gaze conflicts, and the motion of both camera and gaze target across thousands of video sequences to measure zero-shot robustness.
If this is right
- Steeply downward gaze directions require targeted attention in future model design or data collection for HRI applications.
- Training on large and varied gaze datasets delivers broader robustness than adding spatial-temporal layers or transformer blocks.
- Self-adversarial training losses that purify gaze features can deliver further gains once a diverse base dataset is used.
- Practitioners should prioritize models trained on diverse collections such as ETH-X-Gaze when selecting estimators for moving-camera robot settings.
- Future benchmarks must include dynamic camera and target motion to avoid overestimating real-world performance.
Where Pith is reading between the lines
- If data diversity drives robustness, then deliberately expanding existing gaze collections with more extreme angles and lighting could remove the remaining universal failure mode.
- The same benchmarking approach could be applied to other robotic vision tasks to check whether architectural complexity is similarly overvalued relative to data coverage.
- Deployments involving downward camera angles relative to people would need either special handling or fallback strategies until the downward-gaze problem is solved.
- The universal failure on steep downward looks may point to a shared limitation in how current networks process eye appearance when the iris and pupil are partially occluded by the eyelid.
Load-bearing premise
The specific variables captured in Gaze4HRI capture the most important real-world difficulties that gaze estimators will face when deployed on robots.
What would settle it
A new method that succeeds on steeply downward gazes within the Gaze4HRI videos while using a simple non-transformer architecture and a smaller training set would disprove the claim that data diversity is the dominant factor.
Figures
read the original abstract
While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gaze4HRI, a new large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) for zero-shot benchmarking of appearance-based 3D gaze estimation methods under HRI-relevant conditions including illumination variation, head-gaze conflict, and motion of camera and gaze targets in video. Evaluation of multiple state-of-the-art methods shows that all fail in at least one condition, with steeply downward gaze as a universal failure mode; PureGaze trained on the ETH-X-Gaze dataset is the only one resilient across other conditions. The authors conclude that training data diversity is the primary driver of zero-shot robustness, challenging the emphasis on complex spatial-temporal or Transformer architectures.
Significance. If the empirical findings hold after addressing confounds, the work provides a useful new benchmark focused on practical HRI deployment gaps and offers actionable guidance that data curation may yield higher returns than architectural elaboration for generalization. Public release of the dataset and code supports reproducibility and follow-on work in robotics.
major comments (1)
- [Abstract and Results/Discussion] The claim in the abstract that 'extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments' (and the related challenge to Transformer-based architectures) is not load-bearing without controlling for training data. The results highlight PureGaze trained on ETH-X-Gaze as uniquely resilient, but other evaluated methods are not reported with equivalent training on the same diverse corpus; this confounds data volume/diversity with model architecture and self-adversarial loss, preventing unambiguous attribution. A rephrasing of the conclusion or an ablation (e.g., reporting original training sets or cross-training) is needed in the Results and Discussion sections.
minor comments (1)
- [Dataset description] Clarify in the dataset section the precise distribution of frames across each HRI variable (illumination, head-gaze conflict, motions) and any post-collection filtering criteria to allow readers to assess coverage and potential selection effects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and describe the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract and Results/Discussion] The claim in the abstract that 'extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments' (and the related challenge to Transformer-based architectures) is not load-bearing without controlling for training data. The results highlight PureGaze trained on ETH-X-Gaze as uniquely resilient, but other evaluated methods are not reported with equivalent training on the same diverse corpus; this confounds data volume/diversity with model architecture and self-adversarial loss, preventing unambiguous attribution. A rephrasing of the conclusion or an ablation (e.g., reporting original training sets or cross-training) is needed in the Results and Discussion sections.
Authors: We agree that the current presentation risks confounding training data characteristics with model architecture. All methods were evaluated using their publicly released pre-trained weights (or standard training protocols from their original papers), which is the conventional zero-shot setup. ETH-X-Gaze is the largest and most diverse corpus among those represented, and PureGaze trained on it is the only model that remains robust outside the downward-gaze failure mode shared by all methods. This pattern is consistent with our broader claim that data diversity matters, yet we acknowledge the referee's point that a controlled cross-training study would be required for unambiguous causal attribution. We will therefore revise the abstract and the Results/Discussion sections to rephrase the conclusion more cautiously: the findings will be presented as an empirical observation that 'suggests training-data diversity plays a central role' rather than asserting it as the 'primary driver' without qualification. We will also add an explicit limitations paragraph noting the absence of cross-training ablations and the consequent need for future work to disentangle data and architecture effects. This revision addresses the concern without requiring new experiments beyond the scope of the current benchmark. revision: yes
Circularity Check
No circularity: empirical benchmarking with independent external models and datasets
full rationale
The paper introduces Gaze4HRI as a new evaluation benchmark and reports performance of pre-existing methods (including PureGaze trained on the external ETH-X-Gaze corpus) across HRI conditions. All central claims—universal failure on steeply-downward gaze, relative resilience of one configuration, and the inference that data diversity outweighs architectural complexity—are direct summaries of observed metrics on held-out test videos. No equations, fitted parameters, or self-referential definitions appear; the derivation chain consists of standard cross-dataset evaluation steps that do not reduce to quantities defined inside the paper itself. Self-citations, if present, are not load-bearing for the reported results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. A. Abdelrahman, T. Hempel, A. Khalifa, A. Al-Hamadi, and L. Dinges. L2cs-net : Fine-grained gaze estimation in unconstrained environments. In2023 8th International Conference on Frontiers of Signal Processing (ICFSP), pages 98–102, 2023
work page 2023
-
[2]
H. Admoni and B. Scassellati. Social eye gaze in human-robot interaction: A review.Journal of Human-Robot Interaction, 6:25–63, 03 2017
work page 2017
- [3]
-
[4]
M. Cai, F. Lu, and Y . Sato. Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14392–14401, 2020
work page 2020
- [5]
-
[6]
Y . Cheng and F. Lu. Gaze estimation using transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 3341–3347, 2022
work page 2022
- [7]
-
[8]
S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually vanishing bridge for adversarial domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 12455–12464, 2020
work page 2020
-
[9]
T. Fischer, H. J. Chang, and Y . Demiris. Rt-gene: Real-time eye gaze estimation in natural environments. In V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, editors,Computer Vision – ECCV 2018, pages 339–357, Cham, 2018. Springer International Publishing
work page 2018
-
[10]
A. Fischer-Janzen, M. Zhang, N. Lan, M. R. Yuce, M. Abdel-Malek, N. V . Thakor, W. D. Hairston, J. S. Bayouth, and H. Huang. A scoping review of gaze and eye tracking–based control for assistive robotics. Frontiers in Robotics and AI, 10:1302450, 2024
work page 2024
-
[11]
K. A. Funes Mora, F. Monay, and J.-M. Odobez. Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. InProceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’14, page 255–258, New York, NY , USA, 2014. Association for Computing Machinery
work page 2014
- [12]
-
[13]
Y . Guan, Z. Chen, W. Zeng, Z.-G. Cao, and Y . Xiao. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context.IEEE Signal Processing Letters, PP:1–5, 01 2023
work page 2023
-
[14]
Z. Guo, Z. Yuan, C. Zhang, W. Chi, Y . Ling, and S. Zhang. Domain adaptation gaze estimation by embedding with prediction consistency. In H. Ishikawa, C.-L. Liu, T. Pajdla, and J. Shi, editors,Computer Vision, pages 292–307, Cham, 2021. Springer International Publishing
work page 2021
-
[15]
L. Jianfeng and L. Shigang. Eye-model-based gaze estimation by rgb- d camera. In2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 606–610, 2014
work page 2014
-
[16]
M. Jording, A. Hartz, G. Bente, M. Schulte-R ¨uther, and K. V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in Psychology, 09:226, 02 2018
work page 2018
-
[17]
P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Tor- ralba. Gaze360: Physically unconstrained gaze estimation in the wild. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6911–6920, 2019
work page 2019
-
[18]
K. Kompatsiari, V . Tikhanoff, F. Ciardo, G. Metta, and A. Wykowska. The importance of mutual gaze in human-robot interaction. InSocial Robotics, pages 443–452. Springer, 2017
work page 2017
- [19]
-
[20]
P. Lanillos, J. F. Ferreira, and J. Dias. A bayesian hierarchy for robust gaze estimation in human–robot interaction.International Journal of Approximate Reasoning, 87:1–22, 2017
work page 2017
-
[21]
L. Lin, Z. Wu, Y . Lu, Z. Chen, and W. Guo. Recent progress on eye-tracking and gaze estimation for ar/vr applications: A review. Electronics, 14(17):3352, 2025
work page 2025
-
[22]
Y . Liu, R. Liu, H. Wang, and F. Lu. Generalizing gaze estimation with outlier-guided collaborative adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3835–3844, 2021
work page 2021
-
[23]
OptiTrack camera comparison and technical speci- fications, 2024
NaturalPoint, Inc. OptiTrack camera comparison and technical speci- fications, 2024
work page 2024
-
[24]
S. Panagou, W. P. Neumann, and F. Fruggiero. A scoping review of human robot interaction research towards industry 5.0 human-centric workplaces.International Journal of Production Research, 62(3):974– 990, 2024
work page 2024
-
[25]
S. Pathirana, B. Shrestha, and J. Lee. Eye gaze estimation: A survey on deep learning-based approaches.IEEE Access, 10:99741–99764, 2022
work page 2022
-
[26]
D. Qi, W. Tan, Q. Yao, and J. Liu. Yolo5face: Why reinventing a face detector, 05 2021
work page 2021
-
[27]
T. Schreiter, T. Rodrigues de Almeida, Y . Zhu, E. Guti ´errez Maestro, L. Morillo-M ´endez, A. Rudenko, L. Palmieri, T. Kucner, M. Mag- nusson, and A. Lilienthal. Th ¨Or-magni: A large-scale indoor motion capture recording of human movement and robot interaction.The International Journal of Robotics Research, 44, 10 2024
work page 2024
-
[28]
T. Schreiter, A. Rudenko, M. Magnusson, and A. J. Lilienthal. Human gaze and head rotation during navigation, exploration and object manipulation in shared environments with robots. In2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 1258–1265, 2024
work page 2024
-
[29]
P. K. Sharma and P. Chakraborty. A review of driver gaze estimation and application in gaze behavior understanding.Engineering Appli- cations of Artificial Intelligence, 133:108117, 2024
work page 2024
- [30]
-
[31]
C. Thepsoonthorn, K.-i. Ogawa, and Y . Miyake. The relationship between robot’s nonverbal behaviour and human’s likability based on human’s personality.Scientific Reports, 8, 05 2018
work page 2018
- [32]
-
[33]
Universal Robots.UR5 Technical Specifications, 2024
work page 2024
-
[34]
P. Vuillecard and J.-M. Odobez. Enhancing 3d gaze estimation in the wild using weak supervision with gaze following labels. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13508–13518, 2025
work page 2025
-
[35]
X. Wang, J. Zhang, H. Zhang, S. Zhao, and H. Liu. Vision-based gaze estimation: A review.IEEE Transactions on Cognitive and Developmental Systems, 14(2):316–332, 2022
work page 2022
- [36]
- [37]
- [38]
-
[39]
Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang. Bytetrack: Multi-object tracking by associating every detection box. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. APPENDIX A. Ground-Truth Validation The accuracy of our motion-capture system (OptiTrack) has been validated by tracking the UR5 end-effect...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.