TinyGaze: Lightweight Gaze-Gesture Recognition on Commodity Mobile Devices
Pith reviewed 2026-05-14 22:01 UTC · model grok-4.3
The pith
A compact 46k-parameter model recognizes gaze gestures on mobile devices at 96 percent Macro F1 using ARKit head and eye data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors report that their compact time-series model TinyHAR, with only 46k parameters, attains Macro F1 scores of 0.960 for 5-way gaze gesture recognition and 0.997 for 4-way user identification when trained on ARKit head and eye transforms under a scaffolded guidance-to-recall protocol, matching or exceeding deeper baselines while depending primarily on head pose dynamics.
What carries the argument
TinyHAR, a compact time-series model that processes sequences of head and eye transforms from ARKit to classify gaze gestures and identify users.
Load-bearing premise
The accuracy measured in one controlled lab session with four participants will hold for diverse users and everyday mobile environments.
What would settle it
A follow-up experiment with twenty or more participants across multiple sessions in uncontrolled settings such as walking or varying lighting, measuring whether Macro F1 falls below 0.85 on the same gestures.
Figures
read the original abstract
Gaze gestures can provide hands free input on mobile devices, but practical use requires (i) gestures users can learn and recall and (ii) recognition models that are efficient enough for on-device deployment. We present an end-to-end pipeline using commodity ARKit head/eye transforms and a scaffolded guidance-to-recall protocol grounded in learning theory. In a pilot feasibility study (N=4 participants; 240 trials; controlled single-session setting), we benchmark a compact time-series model (TinyHAR) against deeper baselines (DeepConvLSTM, SA-HAR) on 5-way gesture recognition and 4-way user identification. TinyHAR achieves strong performance in this pilot benchmark (Macro F1 = 0.960 for gesture recognition; Macro F1 = 0.997 for user identification) while using only 46k parameters. A modality analysis further indicates that head pose dynamics are highly informative for mobile gaze gestures, highlighting embodied head--eye coordination as a key design consideration. Although the small sample size and controlled setting limit generalizability, these results indicate a potential direction for further investigation into on-device gaze gesture recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TinyGaze, an end-to-end pipeline for gaze-gesture recognition on commodity mobile devices that combines ARKit head/eye transforms with a scaffolded guidance-to-recall protocol. In a pilot feasibility study (N=4 participants, 240 trials, single controlled session), a compact time-series model (TinyHAR, 46k parameters) is benchmarked against DeepConvLSTM and SA-HAR on 5-way gesture recognition and 4-way user identification, reporting Macro F1 scores of 0.960 and 0.997 respectively. A modality analysis highlights the informativeness of head-pose dynamics for these gestures.
Significance. If the pilot performance generalizes, the work would demonstrate that very small models can support accurate on-device gaze gestures, lowering barriers to hands-free mobile input and emphasizing embodied head-eye coordination as a design factor. The current evidence, however, is confined to a narrow controlled setting, so the significance remains prospective pending larger-scale validation.
major comments (2)
- [Pilot feasibility study] Pilot study description: the Macro F1 scores of 0.960 (gesture) and 0.997 (user ID) are obtained from ~60 trials per participant in a single session with no reported cross-subject validation, error bars, or statistical significance tests; this leaves open the possibility that the model fits participant-specific idiosyncrasies rather than transferable signals, directly affecting the claim that the results indicate a viable direction for commodity deployment.
- [Modality analysis] Modality analysis: the statement that head-pose dynamics are 'highly informative' is presented without accompanying ablation results, feature-importance metrics, or quantitative comparison of head-pose-only versus eye-only inputs, making it impossible to assess how much of the reported performance depends on this modality.
minor comments (2)
- [Abstract] Clarify the exact relationship between the system name TinyGaze and the model name TinyHAR; the abstract uses both without explicit mapping.
- [Methods] Provide the precise list of the five gestures and the details of the scaffolded guidance-to-recall protocol so that the study can be replicated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our pilot feasibility study. We address each major comment below, clarifying the scope of our claims and outlining revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Pilot feasibility study] Pilot study description: the Macro F1 scores of 0.960 (gesture) and 0.997 (user ID) are obtained from ~60 trials per participant in a single session with no reported cross-subject validation, error bars, or statistical significance tests; this leaves open the possibility that the model fits participant-specific idiosyncrasies rather than transferable signals, directly affecting the claim that the results indicate a viable direction for commodity deployment.
Authors: We agree that the current presentation lacks cross-subject validation, error bars, and statistical tests, which is a valid concern for assessing transferability in a small-N pilot. The manuscript already frames the work as a controlled feasibility study with explicit limitations on generalizability. In the revision we will add leave-one-participant-out cross-validation, report standard deviations across folds, and include statistical comparisons (e.g., McNemar tests or paired Wilcoxon tests) between TinyHAR and the baselines to better demonstrate whether performance reflects transferable signals rather than idiosyncrasies. revision: yes
-
Referee: [Modality analysis] Modality analysis: the statement that head-pose dynamics are 'highly informative' is presented without accompanying ablation results, feature-importance metrics, or quantitative comparison of head-pose-only versus eye-only inputs, making it impossible to assess how much of the reported performance depends on this modality.
Authors: We acknowledge that the modality analysis section currently lacks explicit ablation studies or quantitative comparisons. In the revised manuscript we will expand this section to report Macro F1 scores for head-pose-only, eye-only, and combined inputs, along with a simple feature-importance ranking derived from the time-series model, to provide direct quantitative support for the informativeness of head-pose dynamics. revision: yes
Circularity Check
No circularity: empirical pilot benchmark with direct measurements
full rationale
The paper reports results from a controlled pilot study (N=4, 240 trials) using a compact time-series model (TinyHAR) on ARKit head/eye data. No mathematical derivation chain, equations, or first-principles predictions exist. Reported Macro F1 scores (0.960 gesture, 0.997 user ID) are direct empirical measurements on the collected data, not quantities fitted to a subset and then renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained as an empirical feasibility benchmark.
Axiom & Free-Parameter Ledger
free parameters (1)
- TinyHAR model weights
axioms (1)
- domain assumption ARKit head and eye transforms provide reliable input for gaze gesture recognition
Reference graph
Works this paper leans on
-
[1]
Hirotaka Aoki, John Paulin Hansen, and Kenji Itoh. 2008. Learning to interact with a computer by gaze.Behaviour & Information Technology27, 4 (2008), 339–344
work page 2008
-
[2]
Heiko Drewes and Albrecht Schmidt. 2007. Interacting with the computer using gaze gestures. InIfip conference on human-computer interaction. Springer, 475– 488
work page 2007
-
[3]
Carlos Elmadjian and Carlos H Morimoto. 2021. Gazebar: Exploiting the midas touch in gaze interaction. InExtended abstracts of the 2021 CHI conference on human factors in computing systems. 1–7
work page 2021
-
[4]
Kenko Fujii, Gauthier Gras, Antonino Salerno, and Guang-Zhong Yang. 2018. Gaze gesture based human robot interaction for laparoscopic surgery.Medical image analysis44 (2018), 196–214
work page 2018
- [5]
-
[6]
Zhiming Hu, Daniel Haeufle, Syn Schmitt, and Andreas Bulling. 2025. Hoigaze: Gaze estimation during hand-object interactions in extended reality exploiting eye-hand-head coordination. InProceedings of the Special Interest Group on Com- puter Graphics and Interactive Techniques Conference Conference Papers. 1–10
work page 2025
-
[7]
Robert JK Jacob. 1991. The use of eye movements in human-computer interaction techniques: what you look at is what you get.ACM Transactions on Information Systems (TOIS)9, 2 (1991), 152–169
work page 1991
-
[8]
Christina Katsini, Yasmeen Abdrabou, George E Raptis, Mohamed Khamis, and Florian Alt. 2020. The role of eye gaze in security and privacy applications: Survey and future HCI research directions. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–21
work page 2020
-
[9]
Mohamed Khamis, Florian Alt, and Andreas Bulling. 2018. The past, present, and future of gaze-enabled handheld mobile devices: Survey and lessons learned. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services. 1–17
work page 2018
-
[10]
Andy Kong, Karan Ahuja, Mayank Goel, and Chris Harrison. 2021. Eyemu interactions: Gaze+ imu gestures on mobile devices. InProceedings of the 2021 International Conference on Multimodal Interaction. 577–585
work page 2021
-
[11]
Yaxiong Lei, Xinya Gong, Shijing He, Yafei Wang, Mohamed Khamis, and Juan Ye. 2026. The People’s Gaze: Co-Designing and Refining Gaze Gestures with Users and Experts. InProceedings of the 2026 CHI conference on human factors in computing systems
work page 2026
-
[12]
Yaxiong Lei, Shijing He, Huining Feng, Kaixing Zhao, Mohamed Khamis, and Juan Ye. 2023. Protecting Privacy in an Era of Pervasive Camera-Based De- vices: Challenges and Potential Directions. InProceedings of the Fifth UK Mobile, Wearable and Ubiquitous Systems Research Symposium
work page 2023
-
[13]
Yaxiong Lei, Shijing He, Mohamed Khamis, and Juan Ye. 2023. An end-to-end review of gaze estimation and its interactive applications on handheld mobile devices.Comput. Surveys56, 2 (2023), 1–38
work page 2023
- [14]
-
[15]
Yaxiong Lei, Yuheng Wang, Tyler Caslin, Alexander Wisowaty, Xu Zhu, Mohamed Khamis, and Juan Ye. 2023. DynamicRead: Exploring robust gaze interaction methods for reading on handheld mobile devices under dynamic conditions. Proceedings of the ACM on Human-Computer Interaction7, ETRA (2023), 1–17
work page 2023
- [16]
-
[17]
Saif Mahmud, M Tanjid Hasan Tonmoy, Kishor Kumar Bhaumik, AKM Mah- bubur Rahman, M Ashraful Amin, Mohammad Shoyaib, Muhammad Asif Hos- sain Khan, and Amin Ahsan Ali. 2020. Human activity recognition from wearable sensor data using self-attention. InECAI 2020. IOS Press, 1332–1339
work page 2020
-
[18]
Pallavi Mohan, Wooi Boon Goh, Chi-Wing Fu, and Sai-Kit Yeung. 2018. DualGaze: Addressing the midas touch problem in gaze mediated VR interaction. In2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR- Adjunct). IEEE, 79–84
work page 2018
-
[19]
Cristina Palmero, Javier Selva, Mohammad Ali Bagheri, and Sergio Escalera. 2018. Recurrent cnn for 3d gaze estimation using appearance and shape cues.arXiv preprint arXiv:1805.03064(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
T Maxwell Parker, Shervin Badihian, Ahmed Hassoon, Ali S Saber Tehrani, Nathan Farrell, David E Newman-Toker, and Jorge Otero-Millan. 2022. Eye and head movement recordings using smartphones for telemedicine applications: measurements of accuracy and precision.Frontiers in Neurology13 (2022), 789581
work page 2022
-
[21]
Henry L Roediger and Andrew C Butler. 2011. The critical role of retrieval practice in long-term retention.Trends in cognitive sciences15, 1 (2011), 20–27
work page 2011
-
[22]
Lei Shi, Cosmin Copot, and Steve Vanlanduit. 2021. Gaze gesture recognition by graph convolutional networks.Frontiers in Robotics and AI8 (2021)
work page 2021
-
[23]
Nachiappan Valliappan, Na Dai, Ethan Steinberg, Junfeng He, Kantwon Rogers, Venky Ramachandran, Pingmei Xu, Mina Shojaeizadeh, Li Guo, Kai Kohlhoff, et al. 2020. Accelerating eye movement research via accurate and affordable smartphone eye tracking.Nature communications11, 1 (2020)
work page 2020
-
[24]
Janneke Van de Pol, Monique Volman, and Jos Beishuizen. 2010. Scaffolding in teacher–student interaction: A decade of research.Educational psychology review 22, 3 (2010), 271–296
work page 2010
-
[25]
Renzhuo Wan, Shuping Mei, Jun Wang, Min Liu, and Fan Yang. 2019. Multi- variate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting.Electronics8, 8 (2019), 876
work page 2019
-
[26]
Jacob O Wobbrock, Andrew D Wilson, and Yang Li. 2007. Gestures without libraries, toolkits or training: a 1 recognizer for user interface prototypes. In Proceedings of the 20th annual ACM symposium on User interface software and technology. 159–168. CHI EA ’26, April 13–17, 2026, Barcelona, Spain Yaxiong Lei et al
work page 2007
-
[27]
Shumin Zhai. 2003. What’s in the eyes for attentive input.Commun. ACM46, 3 (2003), 34–39
work page 2003
-
[28]
Wenhao Zhang, Melvyn L Smith, Lyndon N Smith, and Abdul Farooq. 2016. Gender and gaze gesture recognition for human-computer interaction.Computer Vision and Image Understanding149 (2016), 32–50
work page 2016
-
[29]
Yexu Zhou, Haibin Zhao, Yiran Huang, Till Riedel, Michael Hefenbrock, and Michael Beigl. 2022. Tinyhar: A lightweight deep learning model designed for human activity recognition. InProceedings of the 2022 ACM International Symposium on Wearable Computers. 89–93
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.