SkillSight: Efficient First-Person Skill Assessment with Gaze
Pith reviewed 2026-05-17 05:35 UTC · model grok-4.3
The pith
SkillSight distills a gaze-only student model from joint video-and-gaze training to assess skill level with high accuracy at far lower power cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Skill level is evident not only in how a person performs an activity but also in how they direct their attention; a two-stage teacher-student framework first learns the joint distribution of gaze and egocentric video, then distills a gaze-only student that achieves state-of-the-art accuracy on three real-world datasets while eliminating continuous video processing.
What carries the argument
Two-stage distillation pipeline in which a teacher jointly models gaze and video for skill prediction and then transfers knowledge to a gaze-only student model for low-power inference.
If this is right
- Skill assessment becomes feasible on always-on wearable devices without draining the battery.
- The same gaze signal can support real-time coaching feedback during practice sessions.
- Datasets that record only eye tracking become sufficient for training future skill models.
- Power savings scale with the duration of the activity, enabling longer monitoring sessions.
Where Pith is reading between the lines
- The distillation step may generalize to other egocentric tasks where attention cues matter more than raw pixel content.
- Combining the gaze-only model with occasional low-frame-rate video checks could further improve robustness without much added cost.
- If gaze data can be captured on commodity smart glasses, the method could support large-scale studies of skill acquisition in everyday settings.
Load-bearing premise
Gaze patterns by themselves remain informative enough to predict skill level once the student has been distilled from video-plus-gaze training data.
What would settle it
A large accuracy drop when the gaze-only student is tested on a held-out activity set where gaze statistics no longer correlate with expert versus novice performance.
Figures
read the original abstract
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SkillSight, a two-stage framework for power-efficient first-person skill assessment. A teacher model jointly processes gaze and egocentric video to predict skill level across cooking, music, and sports tasks; this is distilled into a gaze-only student model that operates at inference without video input. The work claims SOTA performance for the teacher and that the student maintains high accuracy while using 73x less power than competing methods, establishing the value of gaze for skill understanding.
Significance. If the central results hold, the approach could enable practical on-device skill assessment for smart glasses by replacing continuous video processing with low-power gaze input. The cross-domain evaluation and distillation strategy provide a concrete path to power reduction while preserving accuracy, with potential impact on egocentric perception systems for real-world skill learning.
major comments (2)
- [§4] §4 (Experiments): The claim that the gaze-only student maintains high accuracy after distillation is load-bearing for the 73x power reduction result, yet the section provides no ablation isolating the contribution of gaze versus video features in the teacher or measuring performance drop when video context is removed at inference; without this, it is unclear whether skill cues transfer fully to the student.
- [§3.2] §3.2 (Distillation): The distillation loss is described as combining task loss and feature matching, but no analysis shows that this objective forces recovery of video-dependent discrimination cues (e.g., object attention timing) from gaze sequences alone; if the teacher exploits visual content unavailable to the student, the accuracy premise fails.
minor comments (2)
- [Abstract] Abstract: Specific numerical values for accuracy, power measurements, and baseline comparisons are missing, weakening the ability to assess the SOTA and 73x claims at a glance.
- [Figure 2] Figure 2: The diagram of the teacher-student pipeline would benefit from explicit annotation of the distillation loss terms and temperature parameter to match the text description.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The claim that the gaze-only student maintains high accuracy after distillation is load-bearing for the 73x power reduction result, yet the section provides no ablation isolating the contribution of gaze versus video features in the teacher or measuring performance drop when video context is removed at inference; without this, it is unclear whether skill cues transfer fully to the student.
Authors: We agree that the current experiments would be strengthened by explicit ablations isolating the contribution of gaze versus video features. In the revised manuscript we will add a new ablation subsection in §4 that (i) compares teacher performance with and without video input and (ii) reports the accuracy drop between the joint teacher and the gaze-only student across all three datasets. These results will directly quantify how much skill-relevant information transfers through distillation and will support the 73x power-reduction claim. revision: yes
-
Referee: [§3.2] §3.2 (Distillation): The distillation loss is described as combining task loss and feature matching, but no analysis shows that this objective forces recovery of video-dependent discrimination cues (e.g., object attention timing) from gaze sequences alone; if the teacher exploits visual content unavailable to the student, the accuracy premise fails.
Authors: We acknowledge the value of additional analysis showing that the distillation objective recovers video-dependent cues from gaze alone. While the cross-domain results already indicate successful transfer, we will expand §3.2 with a brief discussion of the feature-matching term and add qualitative examples (gaze attention maps aligned with skill-critical events) plus a quantitative cue-recovery metric in the experiments. These additions will clarify how gaze sequences encode the necessary timing and focus information. revision: partial
Circularity Check
No circularity: standard teacher-student distillation with empirical validation on external datasets
full rationale
The paper presents a two-stage pipeline: a teacher model jointly processes gaze and egocentric video to predict skill level, followed by distillation to a gaze-only student. This follows conventional knowledge distillation without reducing predictions to fitted parameters by construction or relying on self-citation chains for core claims. Experiments on three independent datasets (cooking, music, sports) provide external validation. No self-definitional equations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are described. The 73x power reduction is a measured outcome of removing video input at inference, not a definitional tautology. The derivation chain remains self-contained against benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- distillation loss weights and temperature
axioms (1)
- domain assumption Gaze direction and fixation patterns are informative of skill level in physical tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model... Ldis = ||f_p(ê_s) - f_t([e_v, e_c, e_g])||_1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Visual strategies of young soccer players during a passing test – a pilot study.Journal of Eye Movement Research, 15 (1), 2022. 2
work page 2022
-
[2]
Yusuke Akamatsu, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. Classification of expert-novice level using eye tracking and motion data via conditional multimodal variational autoencoder. InICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 1360–1364. IEEE, 2021. 2, 3, 5, 6, 8
work page 2021
-
[3]
Where does gaze lead? integrating gaze and motion for en- hanced 3d pose estimation
Taravat Anvari, Markus Lappe, and Marc H E de Lussanet. Where does gaze lead? integrating gaze and motion for en- hanced 3d pose estimation. In2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Work- shops (VRW), pages 76–83, 2025. 2
work page 2025
-
[4]
Expertaf: Expert action- able feedback from video
Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, and Kristen Grauman. Expertaf: Expert action- able feedback from video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13582– 13594, 2025. 1
work page 2025
-
[5]
Low power environmental image sensors for remote photogrammetry.Sensors, 22(19):7617,
Alpha Yaya Balde, Emmanuel Bergeret, Denis Cajal, and Jean-Pierre Toumazet. Low power environmental image sensors for remote photogrammetry.Sensors, 22(19):7617,
-
[6]
Am i a baller? basketball performance assessment from first-person videos
Gedas Bertasius, Hyun Soo Park, Stella X Yu, and Jianbo Shi. Am i a baller? basketball performance assessment from first-person videos. InProceedings of the IEEE inter- national conference on computer vision, pages 2177–2185,
-
[7]
Is space-time attention all you need for video understanding? InIcml, page 4, 2021
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, page 4, 2021. 3, 5, 6, 7, 8
work page 2021
-
[8]
Skillformer: Unified multi-view video understanding for proficiency estimation,
Edoardo Bianchi and Antonio Liotta. Skillformer: Unified multi-view video understanding for proficiency estimation,
-
[9]
Bj ¨orn Braun, Rayan Armani, Manuel Meier, Max Moe- bus, and Christian Holz. egoppg: Heart rate estimation from eye-tracking cameras in egocentric systems to benefit downstream vision tasks.arXiv preprint arXiv:2502.20879,
-
[10]
Tad T Bruny ´e, Trafton Drew, Donald L Weaver, and Joann G Elmore. A review of eye tracking for understand- ing and improving diagnostic interpretation.Cognitive re- search: principles and implications, 4(1):7, 2019. 2
work page 2019
-
[11]
Flexible frame selection for efficient video reasoning
Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29071–29082,
-
[12]
Video action differencing.arXiv preprint arXiv:2503.07860, 2025
James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, and Ser- ena Yeung-Levy. Video action differencing.arXiv preprint arXiv:2503.07860, 2025. 1
-
[13]
Michel A. Cara. The effect of practice and musical structure on pianists’ eye-hand span and visual monitoring.Journal of Eye Movement Research, 16(2):1–18, 2023. 4, 8
work page 2023
-
[14]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 6
work page 2017
-
[15]
Joe Causer, Adam Harvey, Richard Snelgrove, Gary Ar- senault, and Oshin Vartanian. Quiet eye training im- proves surgical performance: A randomized controlled study.Frontiers in Psychology, 5:821, 2014. 2
work page 2014
-
[16]
Longfei CHEN, Yuichi NAKAMURA, Kazuaki KONDO, Dima DAMEN, and Walterio MAYOL-CUEV AS. Integra- tion of experts’ and beginners’ machine operation experi- ences to obtain a detailed task model.IEICE TRANSAC- TIONS on Information, E104-D(1):152–161, 2021. 2, 5
work page 2021
-
[17]
Dima Damen, Teesid Leelasawassuk, Osian Haines, An- drew Calway, and Walterio W Mayol-Cuevas. You-do, i- learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. InBMVC, page 3, 2014. 2
work page 2014
-
[18]
Radosvet Desislavov, Fernando Mart´ınez-Plumed, and Jos´e Hern´andez-Orallo. Trends in ai inference energy consump- tion: Beyond the performance-vs-parameter laws of deep learning.Sustainable Computing: Informatics and Systems, 38:100857, 2023. 8
work page 2023
-
[19]
Linfeng Dong, Wei Wang, Yu Qiao, and Xiao Sun. Luci- daction: A hierarchical and multi-model dataset for com- prehensive action quality assessment.Advances in Neural Information Processing Systems, 37:96468–96482, 2024. 2
work page 2024
-
[20]
The pros and cons: Rank-aware temporal attention for skill determination in long videos
Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. The pros and cons: Rank-aware temporal attention for skill determination in long videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7862–7871, 2019. 1, 2
work page 2019
-
[21]
Veronique Drai-Zerbib and Emmanuel Baccino. The influ- ence of expertise in music reading on the detection of tem- poral violations.Visual Cognition, 20(3):267–282, 2012. 2
work page 2012
-
[22]
Towards progress assessment for adaptive hints in educational virtual reality games
Tobias Drey, Pascal Jansen, Fabian Fischbach, Julian From- mel, and Enrico Rukzio. Towards progress assessment for adaptive hints in educational virtual reality games. InEx- tended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, page 1–9, New York, NY , USA, 2020. Association for Computing Machinery. 1
work page 2020
-
[23]
Amr Elkholy, Mohamed E Hussein, Walid Gomaa, Dima Damen, and Emmanuel Saba. Efficient and robust skeleton- based quality assessment and abnormality detection in hu- man action performance.IEEE journal of biomedical and health informatics, 24(1):280–291, 2019. 2
work page 2019
-
[24]
X3d: Expanding architectures for efficient video recognition
Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020. 2, 5, 6, 7
work page 2020
-
[25]
Shijia Feng, Michael Wray, and Walterio Mayol-Cuevas. Evostruggle: A dataset capturing the evolution of strug- gle across activities and skill levels.arXiv preprint arXiv:2510.01362, 2025. 1
-
[26]
Isabel Funke, S ¨oren Torge Mees, J ¨urgen Weitz, and Ste- fanie Speidel. Video-based surgical skill assessment using 9 3d convolutional neural networks.International Journal of Computer Assisted Radiology and Surgery, 14(7):1217– 1225, 2019. 3
work page 2019
-
[27]
Soline Galuret, Nicolas Vall ´ee, Alexandre Tronchot, Herve Thomazeau, Pierre Jannin, and Arnaud Huaulm ´e. Gaze behavior is related to objective technical skills assessment during virtual reality simulator-based surgical training: a proof of concept.International Journal of Computer As- sisted Radiology and Surgery, 18(9):1697–1705, 2023. 2
work page 2023
-
[28]
Listen to look: Action recognition by previewing audio
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. Listen to look: Action recognition by previewing audio. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10457–10467,
-
[29]
Kumie Gedamu, Yanli Ji, Yang Yang, Jie Shao, and Heng Tao Shen. Visual-semantic alignment temporal pars- ing for action quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 2
work page 2024
-
[30]
Kerstin Gidl ¨of, Annika Wallin, Richard Dewhurst, and Kenneth Holmqvist. Using eye tracking to trace a cogni- tive process: Gaze behaviour during decision making in a natural environment.Journal of eye movement research, 6 (1), 2013. 1, 2
work page 2013
-
[31]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
work page 2024
-
[32]
Mary M Hayhoe and Jonathan Samir Matthis. Control of gaze in natural environments: effects of rewards and costs, uncertainty and memory in target selection.Interface focus, 8(4):20180009, 2018. 2
work page 2018
-
[33]
1.1 computing’s energy problem (and what we can do about it)
Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, 2014. 8
work page 2014
-
[34]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 6
work page 2022
-
[35]
Predicting gaze in egocentric video by learning task- dependent attention transition
Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task- dependent attention transition. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 754– 769, 2018. 2
work page 2018
-
[36]
Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato. Mutual context network for jointly estimating egocentric gaze and action.IEEE Transactions on Image Processing, 29:7795–7806, 2020. 4
work page 2020
-
[37]
Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Li- jin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural ac- tivities in real world. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22072–22086, 2024. 1...
work page 2024
-
[38]
Vid2coach: Transform- ing how-to videos into task assistants.arXiv preprint arXiv:2506.00717, 2025
Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kris- ten Grauman, and Amy Pavel. Vid2coach: Transform- ing how-to videos into task assistants.arXiv preprint arXiv:2506.00717, 2025. 1
-
[39]
Inhyeok Jeong, Kento Nakagawa, Rieko Osu, and Kazuyuki Kanosue. Difference in gaze control ability be- tween low and high skill players of a real-time strategy game in esports.PloS one, 17(3):e0265526, 2022. 2, 4, 5
work page 2022
-
[40]
Eyepiano: leveraging gaze for reflective piano learning
Jakob Karolus, Johannes Sylupp, Albrecht Schmidt, and Paweł W Wo´zniak. Eyepiano: leveraging gaze for reflective piano learning. InProceedings of the 2023 ACM Designing Interactive Systems Conference, pages 1209–1223, 2023. 2
work page 2023
-
[41]
Aftab Khan, Sebastian Mellor, Rachel King, Balazs Janko, William Harwin, R Simon Sherratt, Ian Craddock, and Thomas Pl ¨otz. Generalized and efficient skill assessment from imu data with applications in gymnastics and medical training.ACM Transactions on Computing for Healthcare, 2(1):1–21, 2020. 2
work page 2020
-
[42]
GazeGPT: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear,
Robert Konrad, Nitish Padmanaban, J Gabriel Buckmaster, Kevin C Boyle, and Gordon Wetzstein. Gazegpt: Augment- ing human capabilities using gaze-contingent contextual ai for smart eyewear.arXiv preprint arXiv:2401.17217, 2024. 2
-
[43]
Scsam- pler: Sampling salient clips from video for efficient action recognition
Bruno Korbar, Du Tran, and Lorenzo Torresani. Scsam- pler: Sampling salient clips from video for efficient action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6232–6242, 2019. 2
work page 2019
-
[44]
Bolin Lai, Miao Liu, Fiona Ryan, and James M Rehg. In the eye of transformer: Global-local correlation for egocentric gaze estimation.arXiv preprint arXiv:2208.04464, 2022. 2
-
[45]
Listen to look into the future: Audio-visual egocen- tric gaze anticipation
Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, and James M Rehg. Listen to look into the future: Audio-visual egocen- tric gaze anticipation. InEuropean Conference on Com- puter Vision, pages 192–210. Springer, 2024. 2
work page 2024
-
[46]
Michael Land, Neil Mennie, and Jennifer Rusted. The roles of vision and eye movements in the control of activities of daily living.Perception, 28(11):1311–1328, 1999. 2
work page 1999
-
[47]
Hypercam: Low-power on- board computer vision for iot cameras.arXiv preprint arXiv:2501.10547, 2025
Chae Young Lee, Maxwell Fite, Tejus Rao, Sara Achour, Zerina Kapetanovic, et al. Hypercam: Low-power on- board computer vision for iot cameras.arXiv preprint arXiv:2501.10547, 2025. 3
-
[48]
Seungmin Lee and Jongseong An. Gaze control and motor performance in motor expertise studies: Focused review of field application research on perceptual skill training.Inter- national Journal of Applied Sports Sciences, 35(1), 2023. 2, 4, 5
work page 2023
-
[49]
Multi-skeleton structures graph convolu- tional network for action quality assessment in long videos
Qing Lei, Huiying Li, Hongbo Zhang, Jixiang Du, and Shangce Gao. Multi-skeleton structures graph convolu- tional network for action quality assessment in long videos. Applied Intelligence, 53(19):21692–21705, 2023. 2
work page 2023
-
[50]
Learning to pre- dict gaze in egocentric video
Yin Li, Alireza Fathi, and James M Rehg. Learning to pre- dict gaze in egocentric video. InProceedings of the IEEE international conference on computer vision, pages 3216– 3223, 2013. 2 10
work page 2013
-
[51]
Yin Li, Miao Liu, and James M Rehg. In the eye of the be- holder: Gaze and actions in first person video.IEEE trans- actions on pattern analysis and machine intelligence, 45 (6):6731–6747, 2021. 1, 2, 4, 5, 6, 7
work page 2021
-
[52]
A light weight model for active speaker detection
Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, and Liangyin Chen. A light weight model for active speaker detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22932–22941, 2023. 2
work page 2023
-
[53]
Ricaˆ2: Rubric- informed, calibrated assessment of actions
Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Sri- nath GNVV Namburi, and Yin Li. Ricaˆ2: Rubric- informed, calibrated assessment of actions. InProceedings of the European Conference on Computer Vision (ECCV),
-
[54]
Chat2map: Efficient scene mapping from multi-ego conversations
Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Hen- derson, Paul Calamia, Kristen Grauman, and Vamsi Kr- ishna Ithapu. Chat2map: Efficient scene mapping from multi-ego conversations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10554–10564, 2023. 2
work page 2023
-
[55]
Michele Mazzamuto*, Francesco Ragusa*, Antonino Furnari*, and Giovanni Maria Farinella*. Learning to de- tect attended objects in cultural sites with gaze signals and weak object supervision.ACM Journal on Computing and Cultural Heritage, 17(3):1–21, 2024. 2
work page 2024
-
[56]
Michele Mazzamuto, Antonino Furnari, Yoichi Sato, and Giovanni Maria Farinella. Gazing into missteps: Leverag- ing eye-gaze for unsupervised mistake detection in egocen- tric videos of skilled human activities. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8310–8320, 2025. 2
work page 2025
-
[57]
Rachel Melnyk, Timothy Campbell, Tyler Holler, Kather- ine Cameron, Patrick Saba, Michael W Witthaus, Jean Joseph, and Ahmed Ghazi. See like an expert: Gaze- augmented training enhances skill acquisition in a virtual reality robotic suturing task.Journal of Endourology, 35 (3):376–382, 2021. 2
work page 2021
-
[58]
Project aria glasses user man- ual.https : / / facebookresearch
Meta Platforms, Inc. Project aria glasses user man- ual.https : / / facebookresearch . github . io/projectaria_tools/docs/ARK/glasses_ manual/glasses_user_manual, 2025. Accessed: 2025-10-06. 3, 1
work page 2025
-
[59]
Integrating human gaze into attention for egocentric activity recognition
Kyle Min and Jason J Corso. Integrating human gaze into attention for egocentric activity recognition. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1069–1078, 2021. 4
work page 2021
-
[60]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion.arXiv preprint arXiv:2304.07193, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Gaze-guided graph neural network for action anticipation conditioned on inten- tion
S ¨uleyman ¨Ozdel, Yao Rong, Berat Mert Albaba, Yen-Ling Kuo, Xi Wang, and Enkelejda Kasneci. Gaze-guided graph neural network for action anticipation conditioned on inten- tion. InProceedings of the 2024 Symposium on Eye Track- ing Research and Applications, pages 1–9, 2024. 2
work page 2024
-
[62]
Francesca Palermo, Luca Casciano, Lokmane Demagh, Au- relio Teliti, Niccol`o Antonello, Giacomo Gervasoni, Hazem Hesham Yousef Shalby, Marco Brando Paracchini, Simone Mentasti, Hao Quan, et al. Advancements in context recog- nition for edge devices and smart eyewear: Sensors and ap- plications.IEEE Access, 2025. 5, 8
work page 2025
-
[63]
Basket: A large- scale video dataset for fine-grained skill estimation
Yulu Pan, Ce Zhang, and Gedas Bertasius. Basket: A large- scale video dataset for fine-grained skill estimation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3
work page 2025
-
[64]
Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius B ¨ohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, et al. What to say and when to say it: Live fitness coaching as a testbed for situated interaction.Advances in Neural Infor- mation Processing Systems, 37:75853–75882, 2024. 1
work page 2024
-
[65]
What and how well you performed? a multitask learning approach to ac- tion quality assessment
Paritosh Parmar and Brendan Tran Morris. What and how well you performed? a multitask learning approach to ac- tion quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 304–313, 2019. 1
work page 2019
-
[66]
Paritosh Parmar, Jaiden Reddy, and Brendan Morris. Piano skills assessment. In2021 IEEE 23rd international work- shop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2021. 2
work page 2021
-
[67]
Paritosh Parmar, Jaiden Reddy, and Brendan Morris. Piano skills assessment. In2021 IEEE 23rd international work- shop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2021. 3
work page 2021
-
[68]
Do- main knowledge-informed self-supervised representations for workout form assessment
Paritosh Parmar, Amol Gharat, and Helge Rhodin. Do- main knowledge-informed self-supervised representations for workout form assessment. InEuropean Conference on Computer Vision, pages 105–123. Springer, 2022. 2
work page 2022
-
[69]
Akshay Paruchuri, Sinan Hersek, Lavisha Aggarwal, Qiao Yang, Xin Liu, Achin Kulshrestha, Andrea Colaco, Henry Fuchs, and Ishan Chatterjee. Egotrigger: Toward audio- driven image capture for human memory enhancement in all-day energy-efficient smart glasses.arXiv preprint arXiv:2508.01915, 2025. 2, 5, 6, 7
-
[70]
Joris Perra, B ´en´edicte Poulin-Charronnat, Thierry Baccino, and V ´eronique Drai-Zerbib. Review on eye-hand span in sight-reading of music.Journal of eye movement research, 14(4):10–16910, 2021. 5
work page 2021
-
[71]
E2 (go) motion: Motion augmented event stream for egocentric action recognition
Chiara Plizzari, Mirco Planamente, Gabriele Goletto, Marco Cannici, Emanuele Gusso, Matteo Matteucci, and Barbara Caputo. E2 (go) motion: Motion augmented event stream for egocentric action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19935–19947, 2022. 5, 6
work page 2022
-
[72]
Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone.arXiv preprint arXiv:2307.05463, 2023. 5
-
[73]
Fit- nets: Hints for thin deep nets, 2015
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets, 2015. 5
work page 2015
-
[74]
Nicolas Sch ¨arer, Federico Villani, Aishwarya Melatur, Steven Peter, Tommaso Polonelli, and Michele Magno. 11 Electrasight: Fully onboard eye tracking for smart glasses with hybrid eog (heog).IEEE Internet of Things Journal,
-
[75]
Minwoo Seong, Gwangbin Kim, Dohyeon Yeo, Yumin Kang, Heesan Yang, Joseph DelPreto, Wojciech Matusik, Daniela Rus, and SeungJun Kim. Multisensebadminton: Wearable sensor–based biomechanical dataset for evalua- tion of badminton performance.Scientific Data, 11(1):343,
-
[76]
Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, and Andreas Bulling. Privaceye: privacy-preserving head- mounted eye tracking using egocentric scene image and eye movement features. InProceedings of the 11th ACM sym- posium on eye tracking research & applications, pages 1– 10, 2019. 2
work page 2019
-
[77]
Predicting behaviors of basketball players from first person videos
Shan Su, Jung Pyo Hong, Jianbo Shi, and Hyun Soo Park. Predicting behaviors of basketball players from first person videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1501–1510, 2017. 2
work page 2017
-
[78]
Brian Sullivan, Casimir JH Ludwig, Dima Damen, Walterio Mayol-Cuevas, and Iain D Gilchrist. Look-ahead fixations during visuomotor behavior: Evidence from assembling a camping tent.Journal of vision, 21(3):13–13, 2021. 2
work page 2021
-
[79]
R Sunder, Umesh Kumar Lilhore, Anjani Kumar Rai, Ehab Ghith, Mehdi Tlija, Sarita Simaiya, and Afraz Hussain Ma- jeed. Smartapm framework for adaptive power manage- ment in wearable devices using deep reinforcement learn- ing.Scientific Reports, 15(1):6911, 2025. 2
work page 2025
-
[80]
Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. Egodistill: Egocentric head motion distillation for efficient video understanding.Advances in Neural Information Pro- cessing Systems, 36:33485–33498, 2023. 2, 5, 6, 7, 8
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.