Low Latency Gaze Tracking via Latent Optical Sensing
Pith reviewed 2026-05-20 12:26 UTC · model grok-4.3
The pith
A passive optical encoder with microlens array and binary mask captures compact light measurements that a neural network maps directly to gaze direction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a fully passive optical encoder, built from a microlens array and co-designed binary chromium mask, produces a compact set of spatially multiplexed measurements that contain sufficient information for a lightweight neural network to recover gaze direction accurately. By moving feature extraction into the optical domain before any digital readout, the prototype achieves an end-to-end sensing-to-inference latency of 3.4 ms and competitive estimation accuracy without forming or processing full-resolution images.
What carries the argument
The central mechanism is the passive optical encoder formed by a microlens array and co-designed binary chromium mask that performs spatially multiplexed encoding of light into a compact measurement vector captured by a 4x4 phototransistor array.
If this is right
- High-bandwidth image readout and subsequent digital feature extraction are no longer required for real-time gaze tracking.
- End-to-end latency drops to 3.4 ms, which is lower than previously reported research systems.
- Energy consumption decreases because only a small set of measurements is digitized and processed.
- The same optical-encoding principle can support other low-latency human-computer interaction tasks that rely on directional inference.
Where Pith is reading between the lines
- Similar optical pre-processing could be applied to other vision tasks such as hand or object tracking to reduce latency in wearable devices.
- Pairing the encoder with different sensor arrays might allow operation under wider lighting ranges without increasing power draw.
- The approach suggests a path toward embedding gaze tracking directly into everyday surfaces or displays rather than dedicated camera modules.
Load-bearing premise
The compact optical measurements produced by the microlens array and binary mask contain enough information for the neural network to recover accurate gaze direction without access to full-resolution images.
What would settle it
Running the prototype on real-world data with changing head poses and lighting while measuring both latency above 5 ms and gaze error larger than published camera-based systems would falsify the central claim.
Figures
read the original abstract
We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a real-time gaze tracking system using a passive optical encoder consisting of a microlens array and co-designed binary chromium mask to perform spatially multiplexed encoding. The resulting compact measurements are captured by a 4x4 phototransistor array and mapped to gaze direction via a lightweight neural network, eliminating high-bandwidth image readout. The proof-of-concept prototype is reported to achieve 3.4 ms end-to-end sensing-to-inference latency while delivering competitive gaze estimation accuracy on both simulated and real-world data, with advantages in latency and energy efficiency over conventional camera-based pipelines.
Significance. If the central performance claims are substantiated, the work could meaningfully advance ultra-low-latency HCI by demonstrating task-driven optical sensing that integrates feature extraction at the hardware level. The co-design of the optical mask and neural network for gaze-specific measurements offers a concrete example of reducing computational overhead in real-time vision systems.
major comments (2)
- [Results] Results section: the manuscript claims competitive gaze estimation accuracy and 3.4 ms latency but provides no quantitative error metrics (e.g., angular error in degrees), confidence intervals, or details on training/validation splits and cross-validation procedures, preventing verification that the reported numbers support the central claims.
- [Prototype and Evaluation] Prototype and Evaluation sections: the central assumption that the 16 scalar measurements from the 4x4 phototransistor array contain sufficient information for accurate gaze regression depends on the specific optical encoding; the manuscript does not report ablation studies or robustness tests under head motion, illumination changes, or inter-subject eye variation that would confirm the many-to-one mapping can be inverted without loss of pupil or corneal-reflection cues.
minor comments (2)
- [Abstract] Abstract: the statement that the system 'outperforms published research systems' should include explicit latency and accuracy numbers from the referenced works for direct comparison.
- [Methods] Methods: the architecture, layer sizes, and training hyperparameters of the lightweight neural network should be specified, along with the exact optical simulation parameters used for the mask and microlens array.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to clarify the quantitative aspects of our results and to strengthen the evaluation of the optical encoding approach. We will revise the manuscript to address these points directly.
read point-by-point responses
-
Referee: [Results] Results section: the manuscript claims competitive gaze estimation accuracy and 3.4 ms latency but provides no quantitative error metrics (e.g., angular error in degrees), confidence intervals, or details on training/validation splits and cross-validation procedures, preventing verification that the reported numbers support the central claims.
Authors: We agree that the Results section would benefit from more explicit quantitative reporting. In the revised manuscript we will add mean angular error (in degrees) together with standard deviation and 95% confidence intervals for both simulated and real-world experiments. We will also expand the Evaluation section to describe the data partitioning (subject-independent 70/20/10 train/validation/test split) and the 5-fold cross-validation procedure used to assess generalization. These additions will make the performance claims verifiable without altering the core experimental outcomes. revision: yes
-
Referee: [Prototype and Evaluation] Prototype and Evaluation sections: the central assumption that the 16 scalar measurements from the 4x4 phototransistor array contain sufficient information for accurate gaze regression depends on the specific optical encoding; the manuscript does not report ablation studies or robustness tests under head motion, illumination changes, or inter-subject eye variation that would confirm the many-to-one mapping can be inverted without loss of pupil or corneal-reflection cues.
Authors: We acknowledge that additional ablation and robustness analyses would strengthen the central claim. In the revised manuscript we will include an ablation study that compares performance with and without the co-designed binary mask, as well as tests under controlled head motion (up to several centimeters), varying illumination levels, and data collected from multiple subjects. These experiments will demonstrate that the optically encoded measurements remain informative for gaze regression even when traditional pupil or corneal-reflection cues are not explicitly recovered. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical prototype description: a passive optical encoder (microlens array + co-designed binary mask) produces 16 scalar measurements from a 4x4 phototransistor array, which are then fed to a lightweight neural network for gaze regression. No equations, fitted parameters, or self-citations are shown that reduce the claimed 3.4 ms end-to-end latency or competitive accuracy back to the same measurements by construction. The performance numbers are reported as direct experimental outcomes from the built system on simulated and real-world data, with no load-bearing self-referential steps or uniqueness theorems imported from prior author work. The central claim therefore remains independent of its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
FirstName LastName , title =
-
[2]
FirstName Alpher , title =
-
[3]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[4]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[5]
FirstName Alpher and FirstName Gamow , title =
-
[6]
Optical Gaze Tracking with Spatially-Sparse Single-Pixel Detectors , year=
Li, Richard and Whitmire, Eric and Stengel, Michael and Boudaoud, Ben and Kautz, Jan and Luebke, David and Patel, Shwetak and Akşit, Kaan , booktitle=. Optical Gaze Tracking with Spatially-Sparse Single-Pixel Detectors , year=
-
[7]
Li, Tianxing and Liu, Qiang and Zhou, Xia , title =. 2017 , isbn =. doi:10.1145/3131672.3131682 , booktitle =
-
[8]
Sen, Argha and Bandara, Nuwan Sriyantha and Gokarn, Ila and Kandappu, Thivya and Misra, Archan , title =. 2024 , issue_date =. doi:10.1145/3699745 , journal =
-
[9]
Zhao, Guangrong and Yang, Yurun and Liu, Jingwei and Chen, Ning and Shen, Yiran and Wen, Hongkai and Lan, Guohao , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =
work page 2023
-
[10]
Kim, Joohwan and Stengel, Michael and Majercik, Alexander and De Mello, Shalini and Dunn, David and Laine, Samuli and McGuire, Morgan and Luebke, David , title =. 2019 , isbn =. doi:10.1145/3290605.3300780 , booktitle =
-
[11]
Angelopoulos, Anastasios N. and Martel, Julien N.P. and Kohli, Amit P. and Conradt, Jörg and Wetzstein, Gordon , journal=. Event-Based Near-Eye Gaze Tracking Beyond 10,000 Hz , year=
-
[12]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2022 , month=. doi:10.1609/aaai.v36i1.19921 , number=
-
[13]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Wang, Yaoming and Jiang, Yangzhou and Li, Jin and Ni, Bingbing and Dai, Wenrui and Li, Chenglin and Xiong, Hongkai and Li, Teng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =
work page 2022
-
[14]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Mtgls: Multi-task gaze estimation with limited supervision , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[15]
A High-Frame-Rate Eye-Tracking Framework for Mobile Devices , year=
Chang, Yuhu and He, Changyang and Zhao, Yingying and Lu, Tun and Gu, Ning , booktitle=. A High-Frame-Rate Eye-Tracking Framework for Mobile Devices , year=
-
[16]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Cross-encoder for unsupervised gaze representation learning , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[17]
Proceedings of the asian conference on computer vision , pages=
Latentgaze: Cross-domain gaze estimation through gaze-aware analytic latent code manipulation , author=. Proceedings of the asian conference on computer vision , pages=
-
[18]
Klotz, Jeremy and Nayar, Shree K. , title =. 2024 , isbn =. doi:10.1007/978-3-031-73039-9_19 , booktitle =
-
[19]
arXiv preprint arXiv:2412.09774 , year =
A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization , author =. arXiv preprint arXiv:2412.09774 , year =
-
[20]
Tolerance-Aware Deep Optics , author =. arXiv preprint arXiv:2502.04719 , year =
- [21]
-
[22]
Light: Science & Applications , volume=
LOEN: Lensless opto-electronic neural network empowered machine vision , author=. Light: Science & Applications , volume=. 2022 , publisher=
work page 2022
-
[23]
Atanov, Andrei and Fu, Jiawei and Singh, Rishubh and Yu, Isabella and Spielberg, Andrew and Zamir, Amir , title =. Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part LXXIV , pages =. 2024 , isbn =. doi:10.1007/978-3-031-72904-1_27 , abstract =
-
[24]
Image sensing with multilayer nonlinear optical neural networks , author=. Nature Photonics , volume=. 2023 , publisher=
work page 2023
-
[25]
Nature Reviews Physics , volume=
Non-line-of-sight imaging , author=. Nature Reviews Physics , volume=. 2020 , publisher=
work page 2020
-
[26]
Task-driven lens design , volume =
Xinge Yang and Qiang Fu and Yunfeng Nie and Wolfgang Heidrich , journal =. Task-driven lens design , volume =. 2026 , url =. doi:10.1364/OE.588912 , abstract =
-
[27]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[28]
ACM Transactions on Graphics (TOG) , volume=
Collaborative On-Sensor Array Cameras , author=. ACM Transactions on Graphics (TOG) , volume=. 2025 , publisher=
work page 2025
-
[29]
Xucong Zhang and Seonwook Park and Thabo Beeler and Derek Bradley and Siyu Tang and Otmar Hilliges , title =. 2020 , booktitle =
work page 2020
-
[30]
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation , author =. 2019 , journal =. doi:10.1109/TPAMI.2017.2778103 , pages =
-
[31]
Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas , title =. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =. doi:10.1109/CVPR.2015.7299081 , video =
-
[32]
Smith, Brian A. and Yin, Qi and Feiner, Steven K. and Nayar, Shree K. , title =. 2013 , isbn =. doi:10.1145/2501988.2501994 , booktitle =
-
[33]
In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (ETRA ’18)
Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas , title =. 2018 , isbn =. doi:10.1145/3204493.3204548 , booktitle =
-
[34]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[35]
Hadi Amata and Qiang Fu and Wolfgang Heidrich , booktitle =. Comparative Performance Analysis of Multi-level Diffractive Lens and Lens Fabricated by Grayscale Lithography and Soft-imprinting , year =. Optica Imaging Congress 2024 (3D, AOMS, COSI, ISA, pcAOP) , keywords =. doi:10.1364/ISA.2024.ITh4D.1 , abstract =
-
[36]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
The unreasonable effectiveness of deep features as a perceptual metric , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[37]
Niklas Stein and Diederick C. Niehorster and Tamara Watson and Frank Steinicke and Katharina Rifai and Siegfried Wahl and Markus Lappe , title =. i-Perception , volume =. 2021 , doi =
work page 2021
-
[38]
EyeLink 1000 Plus [Apparatus and software] , author =. 2013 , address =
work page 2013
-
[39]
arXiv preprint arXiv:2510.01213 , year=
JaneEye: A 0.5 ms Latency Eye Tracking ASIC for XR Applications , author=. arXiv preprint arXiv:2510.01213 , year=
-
[40]
E-Gaze: Gaze Estimation with Event Camera , author=. IEEE TPAMI , year=
-
[41]
GazeCapsNet: A lightweight gaze Estimation framework , author=. Sensors , volume=. 2025 , publisher=
work page 2025
-
[42]
Bose, Laurie and Chen, Jianing and Carey, Stephen J. and Dudek, Piotr , booktitle=. Pixel Processor Arrays For Low Latency Gaze Estimation , year=
-
[43]
TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for Gaze Estimation , year=
Bonazzi, Pietro and Rüegg, Thomas and Bian, Sizhen and Li, Yawei and Magno, Michele , booktitle=. TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for Gaze Estimation , year=
-
[44]
Chen, Ning and Shen, Yiran and Zhang, Tongyu and Yang, Yanni and Wen, Hongkai , journal=. EX-Gaze: High-Frequency and Low-Latency Gaze Tracking with Hybrid Event-Frame Cameras for On-Device Extended Reality , year=
-
[45]
Bonazzi, Pietro and Bian, Sizhen and Lippolis, Giovanni and Li, Yawei and Sheik, Sadique and Magno, Michele , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =
work page 2024
-
[46]
arXiv preprint arXiv:2508.19544 , year=
WEBEYETRACK: Scalable Eye-Tracking for the Browser via On-Device Few-Shot Personalization , author=. arXiv preprint arXiv:2508.19544 , year=
-
[47]
Applications of Digital Image Processing XLI , editor =
Injoon Hong and Kyeongryeol Bong and Hoi-Jun Yoo , title =. Applications of Digital Image Processing XLI , editor =. 2018 , doi =
work page 2018
-
[48]
Coded aperture snapshot spectral imaging fundus camera , author=. Scientific Reports , volume=. 2023 , publisher=
work page 2023
-
[49]
IEEE Transactions on Computational Imaging , volume=
Flatcam: Thin bare-sensor cameras using coded aperture and computation , author=. IEEE Transactions on Computational Imaging , volume=
-
[50]
IEEE signal processing magazine , volume=
Single-pixel imaging via compressive sampling , author=. IEEE signal processing magazine , volume=. 2008 , publisher=
work page 2008
-
[51]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Gaze360: Physically unconstrained gaze estimation in the wild , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[52]
Twenty years of eye typing: systems and design issues , year =
Majaranta, P\". Twenty years of eye typing: systems and design issues , year =. Proceedings of the 2002 Symposium on Eye Tracking Research & Applications , pages =. doi:10.1145/507072.507076 , abstract =
-
[53]
Guenter, Brian and Finch, Mark and Drucker, Steven and Tan, Desney and Snyder, John , title =. ACM Trans. Graph. , month = nov, articleno =. 2012 , issue_date =. doi:10.1145/2366145.2366183 , abstract =
-
[54]
Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration , volume=
Eye tracking communication devices in amyotrophic lateral sclerosis: impact on disability and quality of life , author=. Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration , volume=. 2013 , publisher=
work page 2013
-
[55]
Proceedings of the SIGCHI conference on Human Factors in Computing Systems , pages=
Interacting with eye movements in virtual environments , author=. Proceedings of the SIGCHI conference on Human Factors in Computing Systems , pages=
-
[56]
Eye-tracking in interactive virtual environments: implementation and evaluation , author=. Applied Sciences , volume=. 2022 , publisher=
work page 2022
-
[57]
Computers & education , volume=
A review study on eye-tracking technology usage in immersive virtual reality learning environments , author=. Computers & education , volume=. 2023 , publisher=
work page 2023
-
[58]
Proceedings of the 26th annual ACM symposium on User interface software and technology , pages=
Gaze locking: passive eye contact detection for human-object interaction , author=. Proceedings of the 26th annual ACM symposium on User interface software and technology , pages=
-
[59]
Proceedings of the 23rd ACM international conference on Multimedia , pages=
An affordable solution for binocular eye tracking and calibration in head-mounted displays , author=. Proceedings of the 23rd ACM international conference on Multimedia , pages=
-
[60]
Proceedings of the ACM on computer graphics and interactive techniques , volume=
Using deep learning to increase eye-tracking robustness, accuracy, and precision in virtual reality , author=. Proceedings of the ACM on computer graphics and interactive techniques , volume=. 2024 , publisher=
work page 2024
-
[61]
2023 8th International Conference on Frontiers of Signal Processing (ICFSP) , pages=
L2cs-net: Fine-grained gaze estimation in unconstrained environments , author=. 2023 8th International Conference on Frontiers of Signal Processing (ICFSP) , pages=. 2023 , organization=
work page 2023
-
[62]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
- [63]
-
[64]
Investigating Bias and Fairness in Appearance-based Gaze Estimation
Investigating Bias and Fairness in Appearance-based Gaze Estimation , author =. 2026 , eprint =. doi:10.48550/arXiv.2604.10707 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10707 2026
-
[65]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[66]
A Computational Approach to Edge Detection , year=
Canny, John , journal=. A Computational Approach to Edge Detection , year=
-
[67]
arXiv preprint arXiv:2211.11936 , year=
One eye is all you need: Lightweight ensembles for gaze estimation with single encoders , author=. arXiv preprint arXiv:2211.11936 , year=
-
[68]
European conference on computer vision , pages=
Towards end-to-end video-based eye-tracking , author=. European conference on computer vision , pages=. 2020 , organization=
work page 2020
-
[69]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Puregaze: Purifying gaze feature for generalizable gaze estimation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[70]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Analyzing and improving the image quality of stylegan , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[71]
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Richardson, Elad and Alaluf, Yuval and Patashnik, Or and Nitzan, Yotam and Azar, Yaniv and Shapiro, Stav and Cohen-Or, Daniel , title =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[72]
International conference on machine learning , pages=
Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=
work page 2019
-
[73]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Mobilenets: Efficient convolutional neural networks for mobile vision applications , author=. arXiv preprint arXiv:1704.04861 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Privacy and Identity Management
What Does Your Gaze Reveal About You? On the Privacy Implications of Eye Tracking , author=. Privacy and Identity Management. Data for Better Living: AI and Privacy , pages=. 2020 , publisher=
work page 2020
-
[75]
Privacy Considerations for a Pervasive Eye Tracking World , author=. Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication , pages=. 2014 , publisher=
work page 2014
-
[76]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Improving few-shot user-specific gaze adaptation via gaze redirection synthesis , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.