Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

Carlos R. del-Blanco; Enmin Zhong; Fernando Jaureguizar; Narciso Garc\'ia

arxiv: 2606.02962 · v1 · pith:TVETRKS2new · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.HC· eess.IV

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

Enmin Zhong , Carlos R. del-Blanco , Fernando Jaureguizar , Narciso Garc\'ia This is my paper

classification 💻 cs.CV cs.AIcs.HCeess.IV

keywords handquerygroundingqueriesappearanceego4degocentricfeatures

0 comments

read the original abstract

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

This paper has not been read by Pith yet.

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

discussion (0)