pith. sign in

arxiv: 2605.09181 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.ET· eess.IV

Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework

Pith reviewed 2026-05-12 02:32 UTC · model grok-4.3

classification 💻 cs.CV cs.ETeess.IV
keywords retinal eye trackingweakly supervised learninggaze estimationtemplate matchingophthalmic imagingeye tracking robustness
0
0 comments X

The pith

A weakly-supervised learning framework delivers robust retinal eye tracking with gaze error below 0.45 degrees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retinal image-based eye tracking offers higher precision than standard pupil and cornea methods used in AR/VR devices, yet existing algorithms depend on classical template matching that breaks down with feature changes and real imaging conditions. The paper presents a new learning-based approach trained under weak supervision to register and track retinal features more reliably. Early tests across six participants reach a 95th-percentile gaze error under 0.45 degrees. If the method holds, it opens practical use of retinal tracking for ophthalmic imaging and higher-accuracy gaze systems. The design avoids the need for dense manual annotations during training.

Core claim

The authors propose a novel weakly-supervised, learning-based framework for robust retinal eye tracking that improves upon classical template-matching registration by handling retinal feature variability and real-world imaging conditions, demonstrated through initial studies achieving a 95th-percentile gaze error below 0.45 degrees across six participants.

What carries the argument

The weakly-supervised learning-based framework that learns to register retinal images for eye position without full supervision or dense labels.

If this is right

  • Retinal eye tracking becomes reliable enough for routine use in ophthalmic imaging systems.
  • AR/VR devices can achieve higher gaze accuracy by switching to retinal methods.
  • Training eye trackers requires far less labeled data than fully supervised alternatives.
  • Tracking stability improves across variable retinal features and capture conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weak-supervision strategy might transfer to other medical image registration tasks with limited annotations.
  • Integration with hardware sensors in consumer devices would test whether the accuracy persists outside controlled studies.
  • Larger-scale validation on diverse age groups and eye pathologies would clarify the method's practical limits.

Load-bearing premise

The accuracy measured in the small group of six participants will hold for larger populations and under the full range of real-world retinal imaging variations.

What would settle it

Applying the trained framework to a new cohort of participants or under previously untested lighting and eye conditions and measuring 95th-percentile gaze error above 0.45 degrees.

read the original abstract

Retinal image-based eye tracking is widely used in ophthalmic imaging and vision science, and is a promising path to deliver higher gaze accuracy than the pupil- and cornea-based approaches commonly used in modern AR/VR devices. Nevertheless, existing retinal tracking algorithms still primarily rely on classical template-matching registration, which can be insufficiently robust to retinal feature variability and real-world imaging conditions. In this work, we propose a novel weakly-supervised, learning-based framework for robust retinal eye tracking. Initial studies demonstrate high accuracy, achieving the 95th-percentile gaze error < 0.45 deg across a cohort of 6 participants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a novel weakly-supervised learning-based framework for retinal eye tracking intended to improve robustness over classical template-matching registration methods under retinal feature variability and real-world imaging conditions. It reports initial results achieving a 95th-percentile gaze error below 0.45 degrees on a cohort of 6 participants.

Significance. If the framework can be shown to generalize reliably, it would offer a useful advance for high-accuracy gaze estimation in ophthalmic imaging and AR/VR systems. The weakly-supervised design could lower annotation costs in medical imaging domains. At present, however, the extremely limited evaluation prevents any confident judgment of practical significance or robustness.

major comments (1)
  1. Abstract: the central accuracy and robustness claims rest on quantitative results from only 6 participants with no reported details on validation protocol, participant diversity, data splits, baselines, error bars, or external test sets. This sample size is insufficient to support generalization to the real-world variability highlighted in the introduction as the failure mode of prior methods.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the abstract and evaluation details below, and will revise the manuscript to provide greater clarity while appropriately scoping our claims.

read point-by-point responses
  1. Referee: Abstract: the central accuracy and robustness claims rest on quantitative results from only 6 participants with no reported details on validation protocol, participant diversity, data splits, baselines, error bars, or external test sets. This sample size is insufficient to support generalization to the real-world variability highlighted in the introduction as the failure mode of prior methods.

    Authors: We agree that the abstract requires additional detail on the evaluation. In the revision we will expand it to specify the validation protocol (participant-wise cross-validation on the 6-person cohort), participant characteristics, data splits, direct quantitative comparison against classical template-matching baselines, and error statistics with appropriate context. We will also ensure error bars or intervals appear in the results. However, we will revise the abstract and introduction language to present these as initial feasibility results on a small cohort rather than evidence of broad generalization or robustness to all real-world variability. Larger-scale validation with external test sets remains future work. revision: partial

standing simulated objections not resolved
  • The evaluation remains limited to a 6-participant cohort, which we cannot expand in the current revision and which inherently restricts strong claims of generalization to real-world retinal feature variability.

Circularity Check

0 steps flagged

No circularity detected; empirical evaluation on small cohort with no self-referential derivation chain

full rationale

The manuscript proposes a weakly-supervised learning framework for retinal eye tracking and supports its accuracy claims solely through empirical testing on a cohort of 6 participants. No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described structure that would reduce any result to its own inputs by construction. The central contribution is an algorithmic framework whose performance is reported experimentally rather than derived in a closed loop, rendering the work self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or assumptions; ledger is empty by default.

pith-pipeline@v0.9.0 · 5437 in / 1000 out tokens · 29922 ms · 2026-05-12T02:32:43.922197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework

    INTRODUCTION Retinal image-based eye tracking has the potential to deliver substantially higher gaze accuracy than traditional pupil- or cornea-based approaches. This is because it measures gaze more directly, by observing where light falls on the retina—particularly relative to the fovea, which defines the center of vision. The core idea is that each gaz...

  2. [2]

    RELA TED WORK 2.1. Retinal Image-based Eye Tracking With the advent of scanning laser ophthalmoscopy (SLO) and adaptive optics SLO (AOSLO), strip-based cross-correlation has become the primary algorithmic paradigm for retinal eye tracking. In this technique, narrow image strips are cross cor- related against a reference retinal image to estimate eye mo- t...

  3. [3]

    Overview The core principle of retinal eye tracking is that gaze direc- tion can be inferred from how retinal features shift in the cap- tured image as the eye rotates

    PROPOSED METHOD 3.1. Overview The core principle of retinal eye tracking is that gaze direc- tion can be inferred from how retinal features shift in the cap- tured image as the eye rotates. Specifically, the gaze (pitch, yaw) is related to the translation of a source retinal image relative to a reference (foveal) retinal image acquired when the user looks...

  4. [4]

    We then fine-tune the descriptor decoder using a triplet loss: Ldesc = X i∈K max(0, m+ϕ pos − 1 2 (ϕneg−rand +ϕ neg−hard)) (2) We refer readers to [8] for details of this loss

    and freeze the shared encoder and detector decoder. We then fine-tune the descriptor decoder using a triplet loss: Ldesc = X i∈K max(0, m+ϕ pos − 1 2 (ϕneg−rand +ϕ neg−hard)) (2) We refer readers to [8] for details of this loss. Furthermore, we propose a keypoint-preserving and boosting loss: Lkp =max(0, h−[ X i∈P σ( Di enhanced −γ t ) −stopgrad( X i∈P σ(...

  5. [5]

    Dataset Experiments were conducted on both phantom-eye and real- eye images over a +/-5◦ gaze range

    EXPERIMENT 4.1. Dataset Experiments were conducted on both phantom-eye and real- eye images over a +/-5◦ gaze range. For the phantom-eye ex- periments, we used a dataset collected with a custom retinal eye tracking system [10]. Ground truth gaze direction (pitch and yaw) was provided by the motorized goniometer stages holding the phantom eye. The dataset ...

  6. [6]

    CONCLUSION AND FUTURE WORK In this paper, we propose a robust, accurate and practical algo- rithmic framework for retinal image-based eye tracking. The proposed approach includes multiple methodological contri- butions, including a task-specialized image registration model and a complementary feature space registration strategy de- signed to improve robus...

  7. [7]

    Substrip-based registration and automatic mon- taging of adaptive optics retinal images,

    Ruixue Liu, Xiaolin Wang, Sujin Hoshi, and Yuhua Zhang, “Substrip-based registration and automatic mon- taging of adaptive optics retinal images,”Biomed. Opt. Express, vol. 15, no. 2, pp. 1311–1330, 2024

  8. [8]

    De-warping of images and improved eye tracking for the scanning laser ophthalmoscope,

    Phillip Bedggood and Andrew Metha, “De-warping of images and improved eye tracking for the scanning laser ophthalmoscope,”PLoS One, 2017

  9. [9]

    Binocular eye tracking with the tracking scanning laser ophthalmoscope,

    Scott Stevenson, Christy Sheehy, and Austin Roorda, “Binocular eye tracking with the tracking scanning laser ophthalmoscope,”Vision Res, vol. 118, pp. 98–104, 2016

  10. [10]

    Active eye-tracking for an adaptive optics scanning laser ophthalmoscope,

    Christy Sheehy, Pavan Tiruveedhula, Ramkumar Sabesan, and Austin Roorda, “Active eye-tracking for an adaptive optics scanning laser ophthalmoscope,” Biomed. Opt. Express, vol. 6, no. 7, pp. 2412–2423, 2015

  11. [11]

    Super- junction: Learning-based junction detection for retinal image registration.,

    Wang Yu, Xiaoye Wang, Zaiwang Gu, Weide Liu, Wee Siong Ng, Weimin Huang, and Jun Cheng, “Super- junction: Learning-based junction detection for retinal image registration.,” inAAAI Conference on Artificial Intelligence, 2024, p. 292–300

  12. [12]

    Robust content-adaptive global registration for multimodal retinal images using weakly supervised deep-learning framework,

    Yiqian Wang, Junkang Zhang, Melina Cavichini, Dirk Bartsch, William Freeman, Troung Nguyen, and Cheol- hong An, “Robust content-adaptive global registration for multimodal retinal images using weakly supervised deep-learning framework,”IEEE Transactions on Image Processing, vol. 30, pp. 3167–3178, 2021

  13. [13]

    Two-step registration on multi-modal retinal images via deep neural networks,

    Junkang Zhang, Yiqian Wang, Ji Dai, Melina Cavichini, Dirk Bartsch, William Freeman, Truong Nguyen, and Cheolhong An, “Two-step registration on multi-modal retinal images via deep neural networks,”IEEE Trans- actions on Image Processing, vol. 31, pp. 823–838, 2022

  14. [14]

    Semi-supervised keypoint detector and descrip- tor for retinal image matching,

    Jiazhen Liu, Xirong Li, Qijie Wei, Jie Xu, and Dayong Ding, “Semi-supervised keypoint detector and descrip- tor for retinal image matching,” in2022 European Con- ference on Computer Vision (ECCV), 2022, pp. 593– 609

  15. [15]

    Accurate regis- tration between ultra-wide-field and narrow angle retina images with 3d eyeball shape optimization,

    Junkang Zhang, Bo Wen, Fritz Gerald P. Kalaw, Melina Cavichini, Dirk-Uwe G. Bartsch, William R. Freeman, Truong Q. Nguyen, and Cheolhong An, “Accurate regis- tration between ultra-wide-field and narrow angle retina images with 3d eyeball shape optimization,” in2023 IEEE International Conference on Image Processing (ICIP), 2023, pp. 2750–2754

  16. [16]

    Gaze- matched, pupil-steered retinal imaging for arcmin preci- sion eye tracking over a 50° gaze range at 200hz,

    Francesco LaRocca, Michael Tilleman, Carmen Wang, Bartlomiej Kowalski, David Li, Youmin Wang, Qiang Yang, Alfredo Dubra, and Mohamed El-Haddad, “Gaze- matched, pupil-steered retinal imaging for arcmin preci- sion eye tracking over a 50° gaze range at 200hz,” in Ophthalmic Technologies XXXV, 2025, p. 15

  17. [17]

    High-speed, image-based eye tracking with a scanning laser ophthalmoscope,

    Christy Sheehy, Qiang Yang, David W. Arathorn, Pavan Tiruveedhula, Johannes F. de Boer, and Austin Roorda, “High-speed, image-based eye tracking with a scanning laser ophthalmoscope,”Biomed. Opt. Express, vol. 3, no. 10, pp. 2611–2622, 2012

  18. [18]

    Retinaregnet: A zero-shot approach for retinal image registration,

    Vishal Balaji Sivaraman, Muhammad Imran, Qingyue Wei, Preethika Muralidharan, Michelle R. Tamplin, Is- abella M. Grumbach, Randy H. Kardon, Jui-Kai Wang, Yuyin Zhou, and Wei Shao, “Retinaregnet: A zero-shot approach for retinal image registration,”Computers in Biology and Medicine, vol. 186, pp. 109645, 2025

  19. [19]

    Object recognition from local scale- invariant features,

    David G. Lowe, “Object recognition from local scale- invariant features,” inProceedings of the IEEE Inter- national Conference on Computer Vision (ICCV), 1999, pp. 1150–1157

  20. [20]

    Emergent correspondence from image diffusion,

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan, “Emergent correspondence from image diffusion,” inThirty- seventh Conference on Neural Information Processing Systems, 2023

  21. [21]

    Zero-reference deep curve estimation for low-light im- age enhancement,

    Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong, “Zero-reference deep curve estimation for low-light im- age enhancement,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1777–1786

  22. [22]

    Superpoint: Self-supervised interest point detection and description,

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Ra- binovich, “Superpoint: Self-supervised interest point detection and description,” inIEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2018, pp. 224–236

  23. [23]

    Superglue: Learn- ing feature matching with graph neural networks,

    Sarlin Paul-Edouard, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich, “Superglue: Learn- ing feature matching with graph neural networks,” in IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020, pp. 4938–4947

  24. [24]

    Orb: An efficient alternative to sift or surf,

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R. Bradski, “Orb: An efficient alternative to sift or surf,” inProceedings of the IEEE International Confer- ence on Computer Vision (ICCV), 2011, pp. 2564–2571

  25. [25]

    A multiresolution spline with application to image mosaics,

    Peter Burt and Edward Adelson, “A multiresolution spline with application to image mosaics,”ACM Trans- actions on Graphics, vol. 2, pp. 217–236, 1983

  26. [26]

    Runtime Analysis The inference time of the proposed algorithm is presented in Table 4, which yields an approximate 14.5 FPS

    SUPPLEMENTARY MA TERIALS 7.1. Runtime Analysis The inference time of the proposed algorithm is presented in Table 4, which yields an approximate 14.5 FPS. The exper- iment is run on one NVIDIA RTX 3080 GPU, with test im- age size 253×207 and batch size of 1. The canonical feature space is constructed once per subject in approximately 3.8 seconds (on the s...