Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Ali G\"orkem K\"u\c{c}\"uk; Berk Sezer; Erol \c{S}ahin; Sinan Kalkan

arxiv: 2605.04770 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.HC· cs.LG· cs.RO

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Berk Sezer , Ali G\"orkem K\"u\c{c}\"uk , Erol \c{S}ahin , Sinan Kalkan This is my paper

Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.LGcs.RO

keywords gaze estimationhuman-robot interactionzero-shot robustnessbenchmark datasetdata diversityappearance-based methodsHRI conditions

0 comments

The pith

All tested gaze estimation networks fail in at least one human-robot interaction condition, with data diversity proving the main source of robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Gaze4HRI dataset to test appearance-based gaze estimators under realistic human-robot conditions that existing benchmarks ignore. It shows every current method breaks down somewhere in the new test set, with steeply downward gaze directions creating a failure shared by all approaches. One exception stands out: the PureGaze model trained on the broad ETH-X-Gaze collection handles every other condition without collapse. The authors conclude that the variety of training images matters more for reliable zero-shot performance than the addition of transformers, temporal modeling, or other architectural refinements. This gives practitioners a concrete way to pick models likely to work when robots must track eyes during movement and changing light.

Core claim

Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a a

What carries the argument

The Gaze4HRI dataset and evaluation protocol, which systematically varies illumination, head-gaze conflicts, and the motion of both camera and gaze target across thousands of video sequences to measure zero-shot robustness.

If this is right

Steeply downward gaze directions require targeted attention in future model design or data collection for HRI applications.
Training on large and varied gaze datasets delivers broader robustness than adding spatial-temporal layers or transformer blocks.
Self-adversarial training losses that purify gaze features can deliver further gains once a diverse base dataset is used.
Practitioners should prioritize models trained on diverse collections such as ETH-X-Gaze when selecting estimators for moving-camera robot settings.
Future benchmarks must include dynamic camera and target motion to avoid overestimating real-world performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If data diversity drives robustness, then deliberately expanding existing gaze collections with more extreme angles and lighting could remove the remaining universal failure mode.
The same benchmarking approach could be applied to other robotic vision tasks to check whether architectural complexity is similarly overvalued relative to data coverage.
Deployments involving downward camera angles relative to people would need either special handling or fallback strategies until the downward-gaze problem is solved.
The universal failure on steep downward looks may point to a shared limitation in how current networks process eye appearance when the iris and pupil are partially occluded by the eyelid.

Load-bearing premise

The specific variables captured in Gaze4HRI capture the most important real-world difficulties that gaze estimators will face when deployed on robots.

What would settle it

A new method that succeeds on steeply downward gazes within the Gaze4HRI videos while using a simple non-transformer architecture and a smaller training set would disprove the claim that data diversity is the dominant factor.

Figures

Figures reproduced from arXiv: 2605.04770 by Ali G\"orkem K\"u\c{c}\"uk, Berk Sezer, Erol \c{S}ahin, Sinan Kalkan.

**Figure 1.** Figure 1: (a) We introduce Gaze4HRI, an extensive benchmark which includes high-quality gaze recordings for 50+ subjects, view at source ↗

**Figure 3.** Figure 3: Eye-head calibration for gaze ground truth collection: view at source ↗

**Figure 2.** Figure 2: Experimental Setup. The user is instructed to look view at source ↗

**Figure 4.** Figure 4: The four setups used in our analysis. the subject in this experiment. For each method and level, we compute subject-level mean angular error by averaging a subject’s video-level means6 , as shown in Table III. Results: Method rankings by illumination. The subjectlevel mean angular errors are provided in Table III. Pairwise within-subject t-tests (Holm-corrected, α = 0.05) show that PureGaze (E) and GazeTR… view at source ↗

**Figure 5.** Figure 5: Gaze targets on the shared table, for the object view at source ↗

**Figure 6.** Figure 6: Exp. 1: Illustration of different illumination levels. view at source ↗

**Figure 7.** Figure 7: Experiment 3: Samples for low (a) and high (b) levels view at source ↗

read the original abstract

While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gaze4HRI supplies a practical new benchmark for HRI gaze estimation but its claim that data diversity beats architecture rests on a confounded comparison.

read the letter

The paper's main contribution is a new dataset of over 3,000 videos from 50+ subjects that explicitly varies camera motion, target motion, illumination, and head-gaze conflicts in video sequences. They run several existing gaze estimators zero-shot on it and report that every method fails in at least one condition, with steeply downward gaze being the consistent weak point. PureGaze trained on ETH-X-Gaze is the only one that holds up across the rest of the conditions. This gives robotics people a more relevant testbed than the usual static or lab datasets, and the failure patterns are worth knowing for anyone deploying gaze in moving robot scenarios. The dataset and code release is the part that actually adds value here. The central claim that extensive data diversity is the primary driver of robustness, and that this should steer people away from complex spatial-temporal or Transformer models, does not hold up cleanly. Only PureGaze is described as coming from the diverse ETH-X-Gaze training set; the other methods appear to be evaluated from their original, narrower training distributions. That mixes training data effects with architecture differences, so you cannot attribute the performance gap to data volume alone. A cleaner test would retrain the competing models on the same large set or hold training data fixed while varying architecture. The evaluation protocol itself looks straightforward from the abstract, but without seeing the exact splits and metric definitions it is hard to judge whether any post-selection happened. The work is aimed at HRI practitioners who need to pick or adapt gaze models for real robot settings, and at researchers who want a harder zero-shot test than current benchmarks provide. It is worth sending to peer review so the dataset details and the training-data controls can be checked and the claims tightened if needed.

Referee Report

1 major / 1 minor

Summary. The paper introduces Gaze4HRI, a new large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) for zero-shot benchmarking of appearance-based 3D gaze estimation methods under HRI-relevant conditions including illumination variation, head-gaze conflict, and motion of camera and gaze targets in video. Evaluation of multiple state-of-the-art methods shows that all fail in at least one condition, with steeply downward gaze as a universal failure mode; PureGaze trained on the ETH-X-Gaze dataset is the only one resilient across other conditions. The authors conclude that training data diversity is the primary driver of zero-shot robustness, challenging the emphasis on complex spatial-temporal or Transformer architectures.

Significance. If the empirical findings hold after addressing confounds, the work provides a useful new benchmark focused on practical HRI deployment gaps and offers actionable guidance that data curation may yield higher returns than architectural elaboration for generalization. Public release of the dataset and code supports reproducibility and follow-on work in robotics.

major comments (1)

[Abstract and Results/Discussion] The claim in the abstract that 'extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments' (and the related challenge to Transformer-based architectures) is not load-bearing without controlling for training data. The results highlight PureGaze trained on ETH-X-Gaze as uniquely resilient, but other evaluated methods are not reported with equivalent training on the same diverse corpus; this confounds data volume/diversity with model architecture and self-adversarial loss, preventing unambiguous attribution. A rephrasing of the conclusion or an ablation (e.g., reporting original training sets or cross-training) is needed in the Results and Discussion sections.

minor comments (1)

[Dataset description] Clarify in the dataset section the precise distribution of frames across each HRI variable (illumination, head-gaze conflict, motions) and any post-collection filtering criteria to allow readers to assess coverage and potential selection effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and describe the revisions we will implement.

read point-by-point responses

Referee: [Abstract and Results/Discussion] The claim in the abstract that 'extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments' (and the related challenge to Transformer-based architectures) is not load-bearing without controlling for training data. The results highlight PureGaze trained on ETH-X-Gaze as uniquely resilient, but other evaluated methods are not reported with equivalent training on the same diverse corpus; this confounds data volume/diversity with model architecture and self-adversarial loss, preventing unambiguous attribution. A rephrasing of the conclusion or an ablation (e.g., reporting original training sets or cross-training) is needed in the Results and Discussion sections.

Authors: We agree that the current presentation risks confounding training data characteristics with model architecture. All methods were evaluated using their publicly released pre-trained weights (or standard training protocols from their original papers), which is the conventional zero-shot setup. ETH-X-Gaze is the largest and most diverse corpus among those represented, and PureGaze trained on it is the only model that remains robust outside the downward-gaze failure mode shared by all methods. This pattern is consistent with our broader claim that data diversity matters, yet we acknowledge the referee's point that a controlled cross-training study would be required for unambiguous causal attribution. We will therefore revise the abstract and the Results/Discussion sections to rephrase the conclusion more cautiously: the findings will be presented as an empirical observation that 'suggests training-data diversity plays a central role' rather than asserting it as the 'primary driver' without qualification. We will also add an explicit limitations paragraph noting the absence of cross-training ablations and the consequent need for future work to disentangle data and architecture effects. This revision addresses the concern without requiring new experiments beyond the scope of the current benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with independent external models and datasets

full rationale

The paper introduces Gaze4HRI as a new evaluation benchmark and reports performance of pre-existing methods (including PureGaze trained on the external ETH-X-Gaze corpus) across HRI conditions. All central claims—universal failure on steeply-downward gaze, relative resilience of one configuration, and the inference that data diversity outweighs architectural complexity—are direct summaries of observed metrics on held-out test videos. No equations, fitted parameters, or self-referential definitions appear; the derivation chain consists of standard cross-dataset evaluation steps that do not reduce to quantities defined inside the paper itself. Self-citations, if present, are not load-bearing for the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmarking study. It introduces no new mathematical axioms, free parameters, or invented physical entities; all claims rest on standard computer-vision evaluation practices and the representativeness of the collected videos.

pith-pipeline@v0.9.0 · 5633 in / 1157 out tokens · 24516 ms · 2026-05-08T16:55:11.269591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

A. A. Abdelrahman, T. Hempel, A. Khalifa, A. Al-Hamadi, and L. Dinges. L2cs-net : Fine-grained gaze estimation in unconstrained environments. In2023 8th International Conference on Frontiers of Signal Processing (ICFSP), pages 98–102, 2023

work page 2023
[2]

Admoni and B

H. Admoni and B. Scassellati. Social eye gaze in human-robot interaction: A review.Journal of Human-Robot Interaction, 6:25–63, 03 2017

work page 2017
[3]

Balim, S

H. Balim, S. Park, X. Wang, X. Zhang, and O. Hilliges. Efe: End-to- end frame-to-gaze estimation. pages 2688–2697, 06 2023

work page 2023
[4]

M. Cai, F. Lu, and Y . Sato. Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14392–14401, 2020

work page 2020
[5]

Cheng, Y

Y . Cheng, Y . Bao, and F. Lu. Puregaze: Purifying gaze feature for generalizable gaze estimation.Proceedings of the AAAI Conference on Artificial Intelligence, 2022

work page 2022
[6]

Cheng and F

Y . Cheng and F. Lu. Gaze estimation using transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 3341–3347, 2022

work page 2022
[7]

Cheng, H

Y . Cheng, H. Wang, Y . Bao, and F. Lu. Appearance-based gaze estimation with deep learning: A review and benchmark.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):7509–7528, Dec. 2024

work page 2024
[8]

S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually vanishing bridge for adversarial domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 12455–12464, 2020

work page 2020
[9]

Fischer, H

T. Fischer, H. J. Chang, and Y . Demiris. Rt-gene: Real-time eye gaze estimation in natural environments. In V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, editors,Computer Vision – ECCV 2018, pages 339–357, Cham, 2018. Springer International Publishing

work page 2018
[10]

Fischer-Janzen, M

A. Fischer-Janzen, M. Zhang, N. Lan, M. R. Yuce, M. Abdel-Malek, N. V . Thakor, W. D. Hairston, J. S. Bayouth, and H. Huang. A scoping review of gaze and eye tracking–based control for assistive robotics. Frontiers in Robotics and AI, 10:1302450, 2024

work page 2024
[11]

K. A. Funes Mora, F. Monay, and J.-M. Odobez. Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. InProceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’14, page 255–258, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014
[12]

Ghosh, A

R. Ghosh, A. Dutta, and J. Matas. Automatic gaze analysis: A survey of deep learning based approaches.Computer Vision and Image Understanding, 214:103313, 2022

work page 2022
[13]

Y . Guan, Z. Chen, W. Zeng, Z.-G. Cao, and Y . Xiao. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context.IEEE Signal Processing Letters, PP:1–5, 01 2023

work page 2023
[14]

Z. Guo, Z. Yuan, C. Zhang, W. Chi, Y . Ling, and S. Zhang. Domain adaptation gaze estimation by embedding with prediction consistency. In H. Ishikawa, C.-L. Liu, T. Pajdla, and J. Shi, editors,Computer Vision, pages 292–307, Cham, 2021. Springer International Publishing

work page 2021
[15]

Jianfeng and L

L. Jianfeng and L. Shigang. Eye-model-based gaze estimation by rgb- d camera. In2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 606–610, 2014

work page 2014
[16]

social gaze space

M. Jording, A. Hartz, G. Bente, M. Schulte-R ¨uther, and K. V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in Psychology, 09:226, 02 2018

work page 2018
[17]

Kellnhofer, A

P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Tor- ralba. Gaze360: Physically unconstrained gaze estimation in the wild. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6911–6920, 2019

work page 2019
[18]

Kompatsiari, V

K. Kompatsiari, V . Tikhanoff, F. Ciardo, G. Metta, and A. Wykowska. The importance of mutual gaze in human-robot interaction. InSocial Robotics, pages 443–452. Springer, 2017

work page 2017
[19]

Krafka, A

K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2176–2184, 2016

work page 2016
[20]

Lanillos, J

P. Lanillos, J. F. Ferreira, and J. Dias. A bayesian hierarchy for robust gaze estimation in human–robot interaction.International Journal of Approximate Reasoning, 87:1–22, 2017

work page 2017
[21]

L. Lin, Z. Wu, Y . Lu, Z. Chen, and W. Guo. Recent progress on eye-tracking and gaze estimation for ar/vr applications: A review. Electronics, 14(17):3352, 2025

work page 2025
[22]

Y . Liu, R. Liu, H. Wang, and F. Lu. Generalizing gaze estimation with outlier-guided collaborative adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3835–3844, 2021

work page 2021
[23]

OptiTrack camera comparison and technical speci- fications, 2024

NaturalPoint, Inc. OptiTrack camera comparison and technical speci- fications, 2024

work page 2024
[24]

Panagou, W

S. Panagou, W. P. Neumann, and F. Fruggiero. A scoping review of human robot interaction research towards industry 5.0 human-centric workplaces.International Journal of Production Research, 62(3):974– 990, 2024

work page 2024
[25]

Pathirana, B

S. Pathirana, B. Shrestha, and J. Lee. Eye gaze estimation: A survey on deep learning-based approaches.IEEE Access, 10:99741–99764, 2022

work page 2022
[26]

D. Qi, W. Tan, Q. Yao, and J. Liu. Yolo5face: Why reinventing a face detector, 05 2021

work page 2021
[27]

Schreiter, T

T. Schreiter, T. Rodrigues de Almeida, Y . Zhu, E. Guti ´errez Maestro, L. Morillo-M ´endez, A. Rudenko, L. Palmieri, T. Kucner, M. Mag- nusson, and A. Lilienthal. Th ¨Or-magni: A large-scale indoor motion capture recording of human movement and robot interaction.The International Journal of Robotics Research, 44, 10 2024

work page 2024
[28]

Schreiter, A

T. Schreiter, A. Rudenko, M. Magnusson, and A. J. Lilienthal. Human gaze and head rotation during navigation, exploration and object manipulation in shared environments with robots. In2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 1258–1265, 2024

work page 2024
[29]

P. K. Sharma and P. Chakraborty. A review of driver gaze estimation and application in gaze behavior understanding.Engineering Appli- cations of Artificial Intelligence, 133:108117, 2024

work page 2024
[30]

Sugano, Y

Y . Sugano, Y . Matsushita, and Y . Sato. Learning-by-synthesis for appearance-based 3D gaze estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1821–1828, 2014

work page 2014
[31]

Thepsoonthorn, K.-i

C. Thepsoonthorn, K.-i. Ogawa, and Y . Miyake. The relationship between robot’s nonverbal behaviour and human’s likability based on human’s personality.Scientific Reports, 8, 05 2018

work page 2018
[32]

Tzeng, J

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discrim- inative domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167–7176, 2017

work page 2017
[33]

Universal Robots.UR5 Technical Specifications, 2024

work page 2024
[34]

Vuillecard and J.-M

P. Vuillecard and J.-M. Odobez. Enhancing 3d gaze estimation in the wild using weak supervision with gaze following labels. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13508–13518, 2025

work page 2025
[35]

X. Wang, J. Zhang, H. Zhang, S. Zhao, and H. Liu. Vision-based gaze estimation: A review.IEEE Transactions on Cognitive and Developmental Systems, 14(2):316–332, 2022

work page 2022
[36]

Zhang, S

X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges. Eth- xgaze: A large-scale dataset for gaze estimation under extreme head pose and gaze variation. InProc. European Conference on Computer Vision (ECCV), 2020

work page 2020
[37]

Zhang, Y

X. Zhang, Y . Sugano, and A. Bulling. Revisiting data normalization for appearance-based gaze estimation. InProceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, ETRA ’18, pages 1–9, New York, NY , USA, 2018. Association for Computing Machinery

work page 2018
[38]

Zhang, Y

X. Zhang, Y . Sugano, M. Fritz, and A. Bulling. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. InIEEE Transac- tions on Pattern Analysis and Machine Intelligence, volume 41, pages 162–175, 2019

work page 2019
[39]

Zhang, P

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang. Bytetrack: Multi-object tracking by associating every detection box. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. APPENDIX A. Ground-Truth Validation The accuracy of our motion-capture system (OptiTrack) has been validated by tracking the UR5 end-effect...

work page 2022

[1] [1]

A. A. Abdelrahman, T. Hempel, A. Khalifa, A. Al-Hamadi, and L. Dinges. L2cs-net : Fine-grained gaze estimation in unconstrained environments. In2023 8th International Conference on Frontiers of Signal Processing (ICFSP), pages 98–102, 2023

work page 2023

[2] [2]

Admoni and B

H. Admoni and B. Scassellati. Social eye gaze in human-robot interaction: A review.Journal of Human-Robot Interaction, 6:25–63, 03 2017

work page 2017

[3] [3]

Balim, S

H. Balim, S. Park, X. Wang, X. Zhang, and O. Hilliges. Efe: End-to- end frame-to-gaze estimation. pages 2688–2697, 06 2023

work page 2023

[4] [4]

M. Cai, F. Lu, and Y . Sato. Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14392–14401, 2020

work page 2020

[5] [5]

Cheng, Y

Y . Cheng, Y . Bao, and F. Lu. Puregaze: Purifying gaze feature for generalizable gaze estimation.Proceedings of the AAAI Conference on Artificial Intelligence, 2022

work page 2022

[6] [6]

Cheng and F

Y . Cheng and F. Lu. Gaze estimation using transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 3341–3347, 2022

work page 2022

[7] [7]

Cheng, H

Y . Cheng, H. Wang, Y . Bao, and F. Lu. Appearance-based gaze estimation with deep learning: A review and benchmark.IEEE Trans. Pattern Anal. Mach. Intell., 46(12):7509–7528, Dec. 2024

work page 2024

[8] [8]

S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian. Gradually vanishing bridge for adversarial domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 12455–12464, 2020

work page 2020

[9] [9]

Fischer, H

T. Fischer, H. J. Chang, and Y . Demiris. Rt-gene: Real-time eye gaze estimation in natural environments. In V . Ferrari, M. Hebert, C. Sminchisescu, and Y . Weiss, editors,Computer Vision – ECCV 2018, pages 339–357, Cham, 2018. Springer International Publishing

work page 2018

[10] [10]

Fischer-Janzen, M

A. Fischer-Janzen, M. Zhang, N. Lan, M. R. Yuce, M. Abdel-Malek, N. V . Thakor, W. D. Hairston, J. S. Bayouth, and H. Huang. A scoping review of gaze and eye tracking–based control for assistive robotics. Frontiers in Robotics and AI, 10:1302450, 2024

work page 2024

[11] [11]

K. A. Funes Mora, F. Monay, and J.-M. Odobez. Eyediap: a database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. InProceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’14, page 255–258, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014

[12] [12]

Ghosh, A

R. Ghosh, A. Dutta, and J. Matas. Automatic gaze analysis: A survey of deep learning based approaches.Computer Vision and Image Understanding, 214:103313, 2022

work page 2022

[13] [13]

Y . Guan, Z. Chen, W. Zeng, Z.-G. Cao, and Y . Xiao. End-to-end video gaze estimation via capturing head-face-eye spatial-temporal interaction context.IEEE Signal Processing Letters, PP:1–5, 01 2023

work page 2023

[14] [14]

Z. Guo, Z. Yuan, C. Zhang, W. Chi, Y . Ling, and S. Zhang. Domain adaptation gaze estimation by embedding with prediction consistency. In H. Ishikawa, C.-L. Liu, T. Pajdla, and J. Shi, editors,Computer Vision, pages 292–307, Cham, 2021. Springer International Publishing

work page 2021

[15] [15]

Jianfeng and L

L. Jianfeng and L. Shigang. Eye-model-based gaze estimation by rgb- d camera. In2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 606–610, 2014

work page 2014

[16] [16]

social gaze space

M. Jording, A. Hartz, G. Bente, M. Schulte-R ¨uther, and K. V ogeley. The “social gaze space”: A taxonomy for gaze-based communication in triadic interactions.Frontiers in Psychology, 09:226, 02 2018

work page 2018

[17] [17]

Kellnhofer, A

P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Tor- ralba. Gaze360: Physically unconstrained gaze estimation in the wild. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6911–6920, 2019

work page 2019

[18] [18]

Kompatsiari, V

K. Kompatsiari, V . Tikhanoff, F. Ciardo, G. Metta, and A. Wykowska. The importance of mutual gaze in human-robot interaction. InSocial Robotics, pages 443–452. Springer, 2017

work page 2017

[19] [19]

Krafka, A

K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2176–2184, 2016

work page 2016

[20] [20]

Lanillos, J

P. Lanillos, J. F. Ferreira, and J. Dias. A bayesian hierarchy for robust gaze estimation in human–robot interaction.International Journal of Approximate Reasoning, 87:1–22, 2017

work page 2017

[21] [21]

L. Lin, Z. Wu, Y . Lu, Z. Chen, and W. Guo. Recent progress on eye-tracking and gaze estimation for ar/vr applications: A review. Electronics, 14(17):3352, 2025

work page 2025

[22] [22]

Y . Liu, R. Liu, H. Wang, and F. Lu. Generalizing gaze estimation with outlier-guided collaborative adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3835–3844, 2021

work page 2021

[23] [23]

OptiTrack camera comparison and technical speci- fications, 2024

NaturalPoint, Inc. OptiTrack camera comparison and technical speci- fications, 2024

work page 2024

[24] [24]

Panagou, W

S. Panagou, W. P. Neumann, and F. Fruggiero. A scoping review of human robot interaction research towards industry 5.0 human-centric workplaces.International Journal of Production Research, 62(3):974– 990, 2024

work page 2024

[25] [25]

Pathirana, B

S. Pathirana, B. Shrestha, and J. Lee. Eye gaze estimation: A survey on deep learning-based approaches.IEEE Access, 10:99741–99764, 2022

work page 2022

[26] [26]

D. Qi, W. Tan, Q. Yao, and J. Liu. Yolo5face: Why reinventing a face detector, 05 2021

work page 2021

[27] [27]

Schreiter, T

T. Schreiter, T. Rodrigues de Almeida, Y . Zhu, E. Guti ´errez Maestro, L. Morillo-M ´endez, A. Rudenko, L. Palmieri, T. Kucner, M. Mag- nusson, and A. Lilienthal. Th ¨Or-magni: A large-scale indoor motion capture recording of human movement and robot interaction.The International Journal of Robotics Research, 44, 10 2024

work page 2024

[28] [28]

Schreiter, A

T. Schreiter, A. Rudenko, M. Magnusson, and A. J. Lilienthal. Human gaze and head rotation during navigation, exploration and object manipulation in shared environments with robots. In2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), pages 1258–1265, 2024

work page 2024

[29] [29]

P. K. Sharma and P. Chakraborty. A review of driver gaze estimation and application in gaze behavior understanding.Engineering Appli- cations of Artificial Intelligence, 133:108117, 2024

work page 2024

[30] [30]

Sugano, Y

Y . Sugano, Y . Matsushita, and Y . Sato. Learning-by-synthesis for appearance-based 3D gaze estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1821–1828, 2014

work page 2014

[31] [31]

Thepsoonthorn, K.-i

C. Thepsoonthorn, K.-i. Ogawa, and Y . Miyake. The relationship between robot’s nonverbal behaviour and human’s likability based on human’s personality.Scientific Reports, 8, 05 2018

work page 2018

[32] [32]

Tzeng, J

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discrim- inative domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167–7176, 2017

work page 2017

[33] [33]

Universal Robots.UR5 Technical Specifications, 2024

work page 2024

[34] [34]

Vuillecard and J.-M

P. Vuillecard and J.-M. Odobez. Enhancing 3d gaze estimation in the wild using weak supervision with gaze following labels. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13508–13518, 2025

work page 2025

[35] [35]

X. Wang, J. Zhang, H. Zhang, S. Zhao, and H. Liu. Vision-based gaze estimation: A review.IEEE Transactions on Cognitive and Developmental Systems, 14(2):316–332, 2022

work page 2022

[36] [36]

Zhang, S

X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges. Eth- xgaze: A large-scale dataset for gaze estimation under extreme head pose and gaze variation. InProc. European Conference on Computer Vision (ECCV), 2020

work page 2020

[37] [37]

Zhang, Y

X. Zhang, Y . Sugano, and A. Bulling. Revisiting data normalization for appearance-based gaze estimation. InProceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, ETRA ’18, pages 1–9, New York, NY , USA, 2018. Association for Computing Machinery

work page 2018

[38] [38]

Zhang, Y

X. Zhang, Y . Sugano, M. Fritz, and A. Bulling. Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. InIEEE Transac- tions on Pattern Analysis and Machine Intelligence, volume 41, pages 162–175, 2019

work page 2019

[39] [39]

Zhang, P

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang. Bytetrack: Multi-object tracking by associating every detection box. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. APPENDIX A. Ground-Truth Validation The accuracy of our motion-capture system (OptiTrack) has been validated by tracking the UR5 end-effect...

work page 2022