pith. sign in

arxiv: 2511.19629 · v2 · submitted 2025-11-24 · 💻 cs.CV

SkillSight: Efficient First-Person Skill Assessment with Gaze

Pith reviewed 2026-05-17 05:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric perceptionskill assessmentgaze trackingknowledge distillationpower efficiencyfirst-person videocomputer visionwearable computing
0
0 comments X

The pith

SkillSight distills a gaze-only student model from joint video-and-gaze training to assess skill level with high accuracy at far lower power cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that skill level in physical tasks shows up in where a person looks as much as in what the camera records. It builds a teacher model that learns from both egocentric video and gaze, then transfers that knowledge to a student model that needs only gaze input at test time. Experiments across cooking, music, and sports datasets show the student keeps competitive accuracy while cutting power use by a factor of 73. A reader would care because this removes the main barrier to always-on skill feedback on battery-limited devices like smart glasses. The approach therefore opens a route to practical, in-the-wild AI coaching without continuous video capture.

Core claim

Skill level is evident not only in how a person performs an activity but also in how they direct their attention; a two-stage teacher-student framework first learns the joint distribution of gaze and egocentric video, then distills a gaze-only student that achieves state-of-the-art accuracy on three real-world datasets while eliminating continuous video processing.

What carries the argument

Two-stage distillation pipeline in which a teacher jointly models gaze and video for skill prediction and then transfers knowledge to a gaze-only student model for low-power inference.

If this is right

  • Skill assessment becomes feasible on always-on wearable devices without draining the battery.
  • The same gaze signal can support real-time coaching feedback during practice sessions.
  • Datasets that record only eye tracking become sufficient for training future skill models.
  • Power savings scale with the duration of the activity, enabling longer monitoring sessions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distillation step may generalize to other egocentric tasks where attention cues matter more than raw pixel content.
  • Combining the gaze-only model with occasional low-frame-rate video checks could further improve robustness without much added cost.
  • If gaze data can be captured on commodity smart glasses, the method could support large-scale studies of skill acquisition in everyday settings.

Load-bearing premise

Gaze patterns by themselves remain informative enough to predict skill level once the student has been distilled from video-plus-gaze training data.

What would settle it

A large accuracy drop when the gaze-only student is tested on a held-out activity set where gaze statistics no longer correlate with expert versus novice performance.

Figures

Figures reproduced from arXiv: 2511.19629 by Chi Hsuan Wu, Kristen Grauman, Kumar Ashutosh.

Figure 1
Figure 1. Figure 1: Skill assessment with gaze. Experts and novices exhibit distinct attention behaviors, influencing both how they move their head and eyes and what they see, as illustrated here with clips from an expert (top) and novice (bottom) basketball layup from [31]. The proposed method explores the associations between gaze, ac￾tion, and expertise to achieve accurate and power-efficient skill assessment, using either… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Overview of SkillSight-Teacher. We incorporate three components that encode action and gaze correlation, attended object sequence, and gaze trajectory for skill assessment. These features are fused by the fusion layer for prediction. Right: Overview of distillation method. SkillSight-Student learns to distill knowledge from the teacher feature [ev, ec, eg] using the distillation token tdis. As guidan… view at source ↗
Figure 3
Figure 3. Figure 3: What does an expert vs. novice tend to see more of? In these distributions, each patch crops the egocentric frame based on the subject’s gaze coordinates. Our representation surfaces interesting patterns, like (left two boxes) how novice pianists fixate on their hands more often than experts do (77% vs. 45%, as quantified with hand detection), or (right two boxes) how bouldering experts exhibit greater gaz… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. Both SkillSight-T and SkillSight-S better predict skill level than prior work. Experts and novices show distinct gaze patterns consistent with Ego-Exo4D [31] expert commentaries, shown for reference but not used by any model. The last example (bottom right) shows a failure case, highlighting the challenge of assessing skill from subtle movements. Blue rays show gaze direction and depth… view at source ↗
Figure 6
Figure 6. Figure 6: Gaze pattern analysis. SkillSight-S reveals distinct gaze patterns between model-predicted experts and novices. the power consumption of the best baseline, i.e. EgoDis￾till [80], by 43%. Moreover, SkillSight-S demonstrates competitive performance compared to video-based meth￾ods, which are power intensive regardless of the archi￾tecture due to the energy cost of sensing and visual fea￾ture encoding. Compar… view at source ↗
Figure 7
Figure 7. Figure 7: Distinct gaze pattern analysis. We present more dis￾tinct gaze patterns that SkillSight-S reveals between subjects at different skill levels. not be defined for a single instantaneous reading. E. Behavior-level interpretation of gaze In [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SkillSight, a two-stage framework for power-efficient first-person skill assessment. A teacher model jointly processes gaze and egocentric video to predict skill level across cooking, music, and sports tasks; this is distilled into a gaze-only student model that operates at inference without video input. The work claims SOTA performance for the teacher and that the student maintains high accuracy while using 73x less power than competing methods, establishing the value of gaze for skill understanding.

Significance. If the central results hold, the approach could enable practical on-device skill assessment for smart glasses by replacing continuous video processing with low-power gaze input. The cross-domain evaluation and distillation strategy provide a concrete path to power reduction while preserving accuracy, with potential impact on egocentric perception systems for real-world skill learning.

major comments (2)
  1. [§4] §4 (Experiments): The claim that the gaze-only student maintains high accuracy after distillation is load-bearing for the 73x power reduction result, yet the section provides no ablation isolating the contribution of gaze versus video features in the teacher or measuring performance drop when video context is removed at inference; without this, it is unclear whether skill cues transfer fully to the student.
  2. [§3.2] §3.2 (Distillation): The distillation loss is described as combining task loss and feature matching, but no analysis shows that this objective forces recovery of video-dependent discrimination cues (e.g., object attention timing) from gaze sequences alone; if the teacher exploits visual content unavailable to the student, the accuracy premise fails.
minor comments (2)
  1. [Abstract] Abstract: Specific numerical values for accuracy, power measurements, and baseline comparisons are missing, weakening the ability to assess the SOTA and 73x claims at a glance.
  2. [Figure 2] Figure 2: The diagram of the teacher-student pipeline would benefit from explicit annotation of the distillation loss terms and temperature parameter to match the text description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The claim that the gaze-only student maintains high accuracy after distillation is load-bearing for the 73x power reduction result, yet the section provides no ablation isolating the contribution of gaze versus video features in the teacher or measuring performance drop when video context is removed at inference; without this, it is unclear whether skill cues transfer fully to the student.

    Authors: We agree that the current experiments would be strengthened by explicit ablations isolating the contribution of gaze versus video features. In the revised manuscript we will add a new ablation subsection in §4 that (i) compares teacher performance with and without video input and (ii) reports the accuracy drop between the joint teacher and the gaze-only student across all three datasets. These results will directly quantify how much skill-relevant information transfers through distillation and will support the 73x power-reduction claim. revision: yes

  2. Referee: [§3.2] §3.2 (Distillation): The distillation loss is described as combining task loss and feature matching, but no analysis shows that this objective forces recovery of video-dependent discrimination cues (e.g., object attention timing) from gaze sequences alone; if the teacher exploits visual content unavailable to the student, the accuracy premise fails.

    Authors: We acknowledge the value of additional analysis showing that the distillation objective recovers video-dependent cues from gaze alone. While the cross-domain results already indicate successful transfer, we will expand §3.2 with a brief discussion of the feature-matching term and add qualitative examples (gaze attention maps aligned with skill-critical events) plus a quantitative cue-recovery metric in the experiments. These additions will clarify how gaze sequences encode the necessary timing and focus information. revision: partial

Circularity Check

0 steps flagged

No circularity: standard teacher-student distillation with empirical validation on external datasets

full rationale

The paper presents a two-stage pipeline: a teacher model jointly processes gaze and egocentric video to predict skill level, followed by distillation to a gaze-only student. This follows conventional knowledge distillation without reducing predictions to fitted parameters by construction or relying on self-citation chains for core claims. Experiments on three independent datasets (cooking, music, sports) provide external validation. No self-definitional equations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are described. The 73x power reduction is a measured outcome of removing video input at inference, not a definitional tautology. The derivation chain remains self-contained against benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Relies on the domain assumption that gaze encodes skill information and on standard ML training choices whose specific hyperparameters are not detailed in the abstract.

free parameters (1)
  • distillation loss weights and temperature
    Typical hyperparameters in teacher-student training that must be chosen or tuned to achieve the reported accuracy-power trade-off.
axioms (1)
  • domain assumption Gaze direction and fixation patterns are informative of skill level in physical tasks.
    Central hypothesis stated in the abstract that justifies using gaze as the sole input at inference.

pith-pipeline@v0.9.0 · 5482 in / 1105 out tokens · 37827 ms · 2026-05-17T05:35:28.560650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 1 internal anchor

  1. [1]

    Visual strategies of young soccer players during a passing test – a pilot study.Journal of Eye Movement Research, 15 (1), 2022. 2

  2. [2]

    Classification of expert-novice level using eye tracking and motion data via conditional multimodal variational autoencoder

    Yusuke Akamatsu, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Classification of expert-novice level using eye tracking and motion data via conditional multimodal variational autoencoder. InICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1360–1364. IEEE, 2021. 2, 3, 5, 6, 8

  3. [3]

    Where does gaze lead? integrating gaze and motion for en- hanced 3d pose estimation

    Taravat Anvari, Markus Lappe, and Marc H E de Lussanet. Where does gaze lead? integrating gaze and motion for en- hanced 3d pose estimation. In2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Work- shops (VRW), pages 76–83, 2025. 2

  4. [4]

    Expertaf: Expert action- able feedback from video

    Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, and Kristen Grauman. Expertaf: Expert action- able feedback from video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13582– 13594, 2025. 1

  5. [5]

    Low power environmental image sensors for remote photogrammetry.Sensors, 22(19):7617,

    Alpha Yaya Balde, Emmanuel Bergeret, Denis Cajal, and Jean-Pierre Toumazet. Low power environmental image sensors for remote photogrammetry.Sensors, 22(19):7617,

  6. [6]

    Am i a baller? basketball performance assessment from first-person videos

    Gedas Bertasius, Hyun Soo Park, Stella X Yu, and Jianbo Shi. Am i a baller? basketball performance assessment from first-person videos. InProceedings of the IEEE inter- national conference on computer vision, pages 2177–2185,

  7. [7]

    Is space-time attention all you need for video understanding? InIcml, page 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, page 4, 2021. 3, 5, 6, 7, 8

  8. [8]

    Skillformer: Unified multi-view video understanding for proficiency estimation,

    Edoardo Bianchi and Antonio Liotta. Skillformer: Unified multi-view video understanding for proficiency estimation,

  9. [9]

    egoppg: Heart rate estimation from eye-tracking cameras in egocentric systems to benefit downstream vision tasks.arXiv preprint arXiv:2502.20879,

    Bj ¨orn Braun, Rayan Armani, Manuel Meier, Max Moe- bus, and Christian Holz. egoppg: Heart rate estimation from eye-tracking cameras in egocentric systems to benefit downstream vision tasks.arXiv preprint arXiv:2502.20879,

  10. [10]

    A review of eye tracking for understand- ing and improving diagnostic interpretation.Cognitive re- search: principles and implications, 4(1):7, 2019

    Tad T Bruny ´e, Trafton Drew, Donald L Weaver, and Joann G Elmore. A review of eye tracking for understand- ing and improving diagnostic interpretation.Cognitive re- search: principles and implications, 4(1):7, 2019. 2

  11. [11]

    Flexible frame selection for efficient video reasoning

    Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29071–29082,

  12. [12]

    Video action differencing.arXiv preprint arXiv:2503.07860, 2025

    James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, and Ser- ena Yeung-Levy. Video action differencing.arXiv preprint arXiv:2503.07860, 2025. 1

  13. [13]

    Michel A. Cara. The effect of practice and musical structure on pianists’ eye-hand span and visual monitoring.Journal of Eye Movement Research, 16(2):1–18, 2023. 4, 8

  14. [14]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 6

  15. [15]

    Quiet eye training im- proves surgical performance: A randomized controlled study.Frontiers in Psychology, 5:821, 2014

    Joe Causer, Adam Harvey, Richard Snelgrove, Gary Ar- senault, and Oshin Vartanian. Quiet eye training im- proves surgical performance: A randomized controlled study.Frontiers in Psychology, 5:821, 2014. 2

  16. [16]

    Integra- tion of experts’ and beginners’ machine operation experi- ences to obtain a detailed task model.IEICE TRANSAC- TIONS on Information, E104-D(1):152–161, 2021

    Longfei CHEN, Yuichi NAKAMURA, Kazuaki KONDO, Dima DAMEN, and Walterio MAYOL-CUEV AS. Integra- tion of experts’ and beginners’ machine operation experi- ences to obtain a detailed task model.IEICE TRANSAC- TIONS on Information, E104-D(1):152–161, 2021. 2, 5

  17. [17]

    You-do, i- learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video

    Dima Damen, Teesid Leelasawassuk, Osian Haines, An- drew Calway, and Walterio W Mayol-Cuevas. You-do, i- learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. InBMVC, page 3, 2014. 2

  18. [18]

    Trends in ai inference energy consump- tion: Beyond the performance-vs-parameter laws of deep learning.Sustainable Computing: Informatics and Systems, 38:100857, 2023

    Radosvet Desislavov, Fernando Mart´ınez-Plumed, and Jos´e Hern´andez-Orallo. Trends in ai inference energy consump- tion: Beyond the performance-vs-parameter laws of deep learning.Sustainable Computing: Informatics and Systems, 38:100857, 2023. 8

  19. [19]

    Luci- daction: A hierarchical and multi-model dataset for com- prehensive action quality assessment.Advances in Neural Information Processing Systems, 37:96468–96482, 2024

    Linfeng Dong, Wei Wang, Yu Qiao, and Xiao Sun. Luci- daction: A hierarchical and multi-model dataset for com- prehensive action quality assessment.Advances in Neural Information Processing Systems, 37:96468–96482, 2024. 2

  20. [20]

    The pros and cons: Rank-aware temporal attention for skill determination in long videos

    Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. The pros and cons: Rank-aware temporal attention for skill determination in long videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7862–7871, 2019. 1, 2

  21. [21]

    The influ- ence of expertise in music reading on the detection of tem- poral violations.Visual Cognition, 20(3):267–282, 2012

    Veronique Drai-Zerbib and Emmanuel Baccino. The influ- ence of expertise in music reading on the detection of tem- poral violations.Visual Cognition, 20(3):267–282, 2012. 2

  22. [22]

    Towards progress assessment for adaptive hints in educational virtual reality games

    Tobias Drey, Pascal Jansen, Fabian Fischbach, Julian From- mel, and Enrico Rukzio. Towards progress assessment for adaptive hints in educational virtual reality games. InEx- tended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, page 1–9, New York, NY , USA, 2020. Association for Computing Machinery. 1

  23. [23]

    Amr Elkholy, Mohamed E Hussein, Walid Gomaa, Dima Damen, and Emmanuel Saba. Efficient and robust skeleton- based quality assessment and abnormality detection in hu- man action performance.IEEE journal of biomedical and health informatics, 24(1):280–291, 2019. 2

  24. [24]

    X3d: Expanding architectures for efficient video recognition

    Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020. 2, 5, 6, 7

  25. [25]

    Evostruggle: A dataset capturing the evolution of strug- gle across activities and skill levels.arXiv preprint arXiv:2510.01362, 2025

    Shijia Feng, Michael Wray, and Walterio Mayol-Cuevas. Evostruggle: A dataset capturing the evolution of strug- gle across activities and skill levels.arXiv preprint arXiv:2510.01362, 2025. 1

  26. [26]

    Video-based surgical skill assessment using 9 3d convolutional neural networks.International Journal of Computer Assisted Radiology and Surgery, 14(7):1217– 1225, 2019

    Isabel Funke, S ¨oren Torge Mees, J ¨urgen Weitz, and Ste- fanie Speidel. Video-based surgical skill assessment using 9 3d convolutional neural networks.International Journal of Computer Assisted Radiology and Surgery, 14(7):1217– 1225, 2019. 3

  27. [27]

    Soline Galuret, Nicolas Vall ´ee, Alexandre Tronchot, Herve Thomazeau, Pierre Jannin, and Arnaud Huaulm ´e. Gaze behavior is related to objective technical skills assessment during virtual reality simulator-based surgical training: a proof of concept.International Journal of Computer As- sisted Radiology and Surgery, 18(9):1697–1705, 2023. 2

  28. [28]

    Listen to look: Action recognition by previewing audio

    Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. Listen to look: Action recognition by previewing audio. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10457–10467,

  29. [29]

    Visual-semantic alignment temporal pars- ing for action quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2024

    Kumie Gedamu, Yanli Ji, Yang Yang, Jie Shao, and Heng Tao Shen. Visual-semantic alignment temporal pars- ing for action quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 2

  30. [30]

    Using eye tracking to trace a cogni- tive process: Gaze behaviour during decision making in a natural environment.Journal of eye movement research, 6 (1), 2013

    Kerstin Gidl ¨of, Annika Wallin, Richard Dewhurst, and Kenneth Holmqvist. Using eye tracking to trace a cogni- tive process: Gaze behaviour during decision making in a natural environment.Journal of eye movement research, 6 (1), 2013. 1, 2

  31. [31]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  32. [32]

    Control of gaze in natural environments: effects of rewards and costs, uncertainty and memory in target selection.Interface focus, 8(4):20180009, 2018

    Mary M Hayhoe and Jonathan Samir Matthis. Control of gaze in natural environments: effects of rewards and costs, uncertainty and memory in target selection.Interface focus, 8(4):20180009, 2018. 2

  33. [33]

    1.1 computing’s energy problem (and what we can do about it)

    Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, 2014. 8

  34. [34]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 6

  35. [35]

    Predicting gaze in egocentric video by learning task- dependent attention transition

    Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task- dependent attention transition. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 754– 769, 2018. 2

  36. [36]

    Mutual context network for jointly estimating egocentric gaze and action.IEEE Transactions on Image Processing, 29:7795–7806, 2020

    Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato. Mutual context network for jointly estimating egocentric gaze and action.IEEE Transactions on Image Processing, 29:7795–7806, 2020. 4

  37. [37]

    Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural ac- tivities in real world

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Li- jin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural ac- tivities in real world. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22072–22086, 2024. 1...

  38. [38]

    Vid2coach: Transform- ing how-to videos into task assistants.arXiv preprint arXiv:2506.00717, 2025

    Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kris- ten Grauman, and Amy Pavel. Vid2coach: Transform- ing how-to videos into task assistants.arXiv preprint arXiv:2506.00717, 2025. 1

  39. [39]

    Difference in gaze control ability be- tween low and high skill players of a real-time strategy game in esports.PloS one, 17(3):e0265526, 2022

    Inhyeok Jeong, Kento Nakagawa, Rieko Osu, and Kazuyuki Kanosue. Difference in gaze control ability be- tween low and high skill players of a real-time strategy game in esports.PloS one, 17(3):e0265526, 2022. 2, 4, 5

  40. [40]

    Eyepiano: leveraging gaze for reflective piano learning

    Jakob Karolus, Johannes Sylupp, Albrecht Schmidt, and Paweł W Wo´zniak. Eyepiano: leveraging gaze for reflective piano learning. InProceedings of the 2023 ACM Designing Interactive Systems Conference, pages 1209–1223, 2023. 2

  41. [41]

    Generalized and efficient skill assessment from imu data with applications in gymnastics and medical training.ACM Transactions on Computing for Healthcare, 2(1):1–21, 2020

    Aftab Khan, Sebastian Mellor, Rachel King, Balazs Janko, William Harwin, R Simon Sherratt, Ian Craddock, and Thomas Pl ¨otz. Generalized and efficient skill assessment from imu data with applications in gymnastics and medical training.ACM Transactions on Computing for Healthcare, 2(1):1–21, 2020. 2

  42. [42]

    GazeGPT: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear,

    Robert Konrad, Nitish Padmanaban, J Gabriel Buckmaster, Kevin C Boyle, and Gordon Wetzstein. Gazegpt: Augment- ing human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024. 2

  43. [43]

    Scsam- pler: Sampling salient clips from video for efficient action recognition

    Bruno Korbar, Du Tran, and Lorenzo Torresani. Scsam- pler: Sampling salient clips from video for efficient action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6232–6242, 2019. 2

  44. [44]

    In the eye of transformer: Global-local correlation for egocentric gaze estimation.arXiv preprint arXiv:2208.04464, 2022

    Bolin Lai, Miao Liu, Fiona Ryan, and James M Rehg. In the eye of transformer: Global-local correlation for egocentric gaze estimation.arXiv preprint arXiv:2208.04464, 2022. 2

  45. [45]

    Listen to look into the future: Audio-visual egocen- tric gaze anticipation

    Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, and James M Rehg. Listen to look into the future: Audio-visual egocen- tric gaze anticipation. InEuropean Conference on Com- puter Vision, pages 192–210. Springer, 2024. 2

  46. [46]

    The roles of vision and eye movements in the control of activities of daily living.Perception, 28(11):1311–1328, 1999

    Michael Land, Neil Mennie, and Jennifer Rusted. The roles of vision and eye movements in the control of activities of daily living.Perception, 28(11):1311–1328, 1999. 2

  47. [47]

    Hypercam: Low-power on- board computer vision for iot cameras.arXiv preprint arXiv:2501.10547, 2025

    Chae Young Lee, Maxwell Fite, Tejus Rao, Sara Achour, Zerina Kapetanovic, et al. Hypercam: Low-power on- board computer vision for iot cameras.arXiv preprint arXiv:2501.10547, 2025. 3

  48. [48]

    Seungmin Lee and Jongseong An. Gaze control and motor performance in motor expertise studies: Focused review of field application research on perceptual skill training.Inter- national Journal of Applied Sports Sciences, 35(1), 2023. 2, 4, 5

  49. [49]

    Multi-skeleton structures graph convolu- tional network for action quality assessment in long videos

    Qing Lei, Huiying Li, Hongbo Zhang, Jixiang Du, and Shangce Gao. Multi-skeleton structures graph convolu- tional network for action quality assessment in long videos. Applied Intelligence, 53(19):21692–21705, 2023. 2

  50. [50]

    Learning to pre- dict gaze in egocentric video

    Yin Li, Alireza Fathi, and James M Rehg. Learning to pre- dict gaze in egocentric video. InProceedings of the IEEE international conference on computer vision, pages 3216– 3223, 2013. 2 10

  51. [51]

    In the eye of the be- holder: Gaze and actions in first person video.IEEE trans- actions on pattern analysis and machine intelligence, 45 (6):6731–6747, 2021

    Yin Li, Miao Liu, and James M Rehg. In the eye of the be- holder: Gaze and actions in first person video.IEEE trans- actions on pattern analysis and machine intelligence, 45 (6):6731–6747, 2021. 1, 2, 4, 5, 6, 7

  52. [52]

    A light weight model for active speaker detection

    Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen. A light weight model for active speaker detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22932–22941, 2023. 2

  53. [53]

    Ricaˆ2: Rubric- informed, calibrated assessment of actions

    Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Sri- nath GNVV Namburi, and Yin Li. Ricaˆ2: Rubric- informed, calibrated assessment of actions. InProceedings of the European Conference on Computer Vision (ECCV),

  54. [54]

    Chat2map: Efficient scene mapping from multi-ego conversations

    Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Hen- derson, Paul Calamia, Kristen Grauman, and Vamsi Kr- ishna Ithapu. Chat2map: Efficient scene mapping from multi-ego conversations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10554–10564, 2023. 2

  55. [55]

    Learning to de- tect attended objects in cultural sites with gaze signals and weak object supervision.ACM Journal on Computing and Cultural Heritage, 17(3):1–21, 2024

    Michele Mazzamuto*, Francesco Ragusa*, Antonino Furnari*, and Giovanni Maria Farinella*. Learning to de- tect attended objects in cultural sites with gaze signals and weak object supervision.ACM Journal on Computing and Cultural Heritage, 17(3):1–21, 2024. 2

  56. [56]

    Gazing into missteps: Leverag- ing eye-gaze for unsupervised mistake detection in egocen- tric videos of skilled human activities

    Michele Mazzamuto, Antonino Furnari, Yoichi Sato, and Giovanni Maria Farinella. Gazing into missteps: Leverag- ing eye-gaze for unsupervised mistake detection in egocen- tric videos of skilled human activities. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8310–8320, 2025. 2

  57. [57]

    See like an expert: Gaze- augmented training enhances skill acquisition in a virtual reality robotic suturing task.Journal of Endourology, 35 (3):376–382, 2021

    Rachel Melnyk, Timothy Campbell, Tyler Holler, Kather- ine Cameron, Patrick Saba, Michael W Witthaus, Jean Joseph, and Ahmed Ghazi. See like an expert: Gaze- augmented training enhances skill acquisition in a virtual reality robotic suturing task.Journal of Endourology, 35 (3):376–382, 2021. 2

  58. [58]

    Project aria glasses user man- ual.https : / / facebookresearch

    Meta Platforms, Inc. Project aria glasses user man- ual.https : / / facebookresearch . github . io/projectaria_tools/docs/ARK/glasses_ manual/glasses_user_manual, 2025. Accessed: 2025-10-06. 3, 1

  59. [59]

    Integrating human gaze into attention for egocentric activity recognition

    Kyle Min and Jason J Corso. Integrating human gaze into attention for egocentric activity recognition. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021. 4

  60. [60]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion.arXiv preprint arXiv:2304.07193, 2023. 5

  61. [61]

    Gaze-guided graph neural network for action anticipation conditioned on inten- tion

    S ¨uleyman ¨Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. Gaze-guided graph neural network for action anticipation conditioned on inten- tion. InProceedings of the 2024 Symposium on Eye Track- ing Research and Applications, pages 1–9, 2024. 2

  62. [62]

    Advancements in context recog- nition for edge devices and smart eyewear: Sensors and ap- plications.IEEE Access, 2025

    Francesca Palermo, Luca Casciano, Lokmane Demagh, Au- relio Teliti, Niccol`o Antonello, Giacomo Gervasoni, Hazem Hesham Yousef Shalby, Marco Brando Paracchini, Simone Mentasti, Hao Quan, et al. Advancements in context recog- nition for edge devices and smart eyewear: Sensors and ap- plications.IEEE Access, 2025. 5, 8

  63. [63]

    Basket: A large- scale video dataset for fine-grained skill estimation

    Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large- scale video dataset for fine-grained skill estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3

  64. [64]

    What to say and when to say it: Live fitness coaching as a testbed for situated interaction.Advances in Neural Infor- mation Processing Systems, 37:75853–75882, 2024

    Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius B ¨ohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, et al. What to say and when to say it: Live fitness coaching as a testbed for situated interaction.Advances in Neural Infor- mation Processing Systems, 37:75853–75882, 2024. 1

  65. [65]

    What and how well you performed? a multitask learning approach to ac- tion quality assessment

    Paritosh Parmar and Brendan Tran Morris. What and how well you performed? a multitask learning approach to ac- tion quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 304–313, 2019. 1

  66. [66]

    Piano skills assessment

    Paritosh Parmar, Jaiden Reddy, and Brendan Morris. Piano skills assessment. In2021 IEEE 23rd international work- shop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2021. 2

  67. [67]

    Piano skills assessment

    Paritosh Parmar, Jaiden Reddy, and Brendan Morris. Piano skills assessment. In2021 IEEE 23rd international work- shop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2021. 3

  68. [68]

    Do- main knowledge-informed self-supervised representations for workout form assessment

    Paritosh Parmar, Amol Gharat, and Helge Rhodin. Do- main knowledge-informed self-supervised representations for workout form assessment. InEuropean Conference on Computer Vision, pages 105–123. Springer, 2022. 2

  69. [69]

    Egotrigger: Toward audio- driven image capture for human memory enhancement in all-day energy-efficient smart glasses.arXiv preprint arXiv:2508.01915, 2025

    Akshay Paruchuri, Sinan Hersek, Lavisha Aggarwal, Qiao Yang, Xin Liu, Achin Kulshrestha, Andrea Colaco, Henry Fuchs, and Ishan Chatterjee. Egotrigger: Toward audio- driven image capture for human memory enhancement in all-day energy-efficient smart glasses.arXiv preprint arXiv:2508.01915, 2025. 2, 5, 6, 7

  70. [70]

    Review on eye-hand span in sight-reading of music.Journal of eye movement research, 14(4):10–16910, 2021

    Joris Perra, B ´en´edicte Poulin-Charronnat, Thierry Baccino, and V ´eronique Drai-Zerbib. Review on eye-hand span in sight-reading of music.Journal of eye movement research, 14(4):10–16910, 2021. 5

  71. [71]

    E2 (go) motion: Motion augmented event stream for egocentric action recognition

    Chiara Plizzari, Mirco Planamente, Gabriele Goletto, Marco Cannici, Emanuele Gusso, Matteo Matteucci, and Barbara Caputo. E2 (go) motion: Motion augmented event stream for egocentric action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19935–19947, 2022. 5, 6

  72. [72]

    Egovlpv2: Egocentric video-language pre-training with fusion in the backbone.arXiv preprint arXiv:2307.05463, 2023

    Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone.arXiv preprint arXiv:2307.05463, 2023. 5

  73. [73]

    Fit- nets: Hints for thin deep nets, 2015

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets, 2015. 5

  74. [74]

    11 Electrasight: Fully onboard eye tracking for smart glasses with hybrid eog (heog).IEEE Internet of Things Journal,

    Nicolas Sch ¨arer, Federico Villani, Aishwarya Melatur, Steven Peter, Tommaso Polonelli, and Michele Magno. 11 Electrasight: Fully onboard eye tracking for smart glasses with hybrid eog (heog).IEEE Internet of Things Journal,

  75. [75]

    Multisensebadminton: Wearable sensor–based biomechanical dataset for evalua- tion of badminton performance.Scientific Data, 11(1):343,

    Minwoo Seong, Gwangbin Kim, Dohyeon Yeo, Yumin Kang, Heesan Yang, Joseph DelPreto, Wojciech Matusik, Daniela Rus, and SeungJun Kim. Multisensebadminton: Wearable sensor–based biomechanical dataset for evalua- tion of badminton performance.Scientific Data, 11(1):343,

  76. [76]

    Privaceye: privacy-preserving head- mounted eye tracking using egocentric scene image and eye movement features

    Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, and Andreas Bulling. Privaceye: privacy-preserving head- mounted eye tracking using egocentric scene image and eye movement features. InProceedings of the 11th ACM sym- posium on eye tracking research & applications, pages 1– 10, 2019. 2

  77. [77]

    Predicting behaviors of basketball players from first person videos

    Shan Su, Jung Pyo Hong, Jianbo Shi, and Hyun Soo Park. Predicting behaviors of basketball players from first person videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1501–1510, 2017. 2

  78. [78]

    Look-ahead fixations during visuomotor behavior: Evidence from assembling a camping tent.Journal of vision, 21(3):13–13, 2021

    Brian Sullivan, Casimir JH Ludwig, Dima Damen, Walterio Mayol-Cuevas, and Iain D Gilchrist. Look-ahead fixations during visuomotor behavior: Evidence from assembling a camping tent.Journal of vision, 21(3):13–13, 2021. 2

  79. [79]

    Smartapm framework for adaptive power manage- ment in wearable devices using deep reinforcement learn- ing.Scientific Reports, 15(1):6911, 2025

    R Sunder, Umesh Kumar Lilhore, Anjani Kumar Rai, Ehab Ghith, Mehdi Tlija, Sarita Simaiya, and Afraz Hussain Ma- jeed. Smartapm framework for adaptive power manage- ment in wearable devices using deep reinforcement learn- ing.Scientific Reports, 15(1):6911, 2025. 2

  80. [80]

    Egodistill: Egocentric head motion distillation for efficient video understanding.Advances in Neural Information Pro- cessing Systems, 36:33485–33498, 2023

    Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. Egodistill: Egocentric head motion distillation for efficient video understanding.Advances in Neural Information Pro- cessing Systems, 36:33485–33498, 2023. 2, 5, 6, 7, 8

Showing first 80 references.