SkillSpotter raises class-specific mAP from 12.40 to 21.82 and balanced accuracy to 60.40% on Ego-Exo4D by adding adaptive temporal suppression, gated pose fusion, and bidirectional cross-view attention to temporal action detectors.
SkillSight: Efficient First-Person Skill Assessment with Gaze
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SkillSpotter: Pose-Aware Multi-View Skilled Action Detection and Grading in Ego-Exo Videos
SkillSpotter raises class-specific mAP from 12.40 to 21.82 and balanced accuracy to 60.40% on Ego-Exo4D by adding adaptive temporal suppression, gated pose fusion, and bidirectional cross-view attention to temporal action detectors.