Detecting Precise Hand Touch Moments in Egocentric Video

Feras Dayoub; Huy Anh Nguyen; Minh Hoai

arxiv: 2604.12343 · v1 · submitted 2026-04-14 · 💻 cs.CV

Detecting Precise Hand Touch Moments in Egocentric Video

Huy Anh Nguyen , Feras Dayoub , Minh Hoai This is my paper

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric videohand-object contact detectionevent spottingcross-attentiongrasp-aware lossTouchMoment datasetfine-grained manipulationfirst-person vision

0 comments

The pith

A hand-informed context module detects exact contact frames in egocentric video by combining hand-region features with surrounding context via cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem of spotting the single video frame when a hand first touches an object, viewed from the wearer's own camera. Precise timing matters because many downstream systems in augmented reality, robot learning from human demonstrations, and assistive interfaces trigger actions or record events exactly at contact onset. The proposed solution builds a module that extracts spatiotemporal features from detected hand areas and their local surroundings, then uses cross-attention to highlight patterns that signal actual touch rather than near-miss motion. A grasp-aware loss further trains the model to emphasize hand-pose dynamics that distinguish contact from non-contact frames. The authors also release a new dataset of over four thousand egocentric videos containing thousands of annotated contact moments, allowing direct measurement of performance under a strict two-frame tolerance window.

Core claim

The Hand-informed Context Enhanced module learns to identify contact moments by fusing hand-region and contextual features through cross-attention and by applying a grasp-aware loss that penalizes confusion between near-contact and true-contact frames, producing a 16.91 percent gain in average precision over prior event-spotting methods when predictions must fall within two frames of ground truth on the TouchMoment dataset.

What carries the argument

The Hand-informed Context Enhanced (HiCE) module, which applies cross-attention between spatiotemporal features extracted from hand regions and their surrounding context, augmented by a grasp-aware loss and soft labels that emphasize pose and motion patterns at the instant of contact.

If this is right

Augmented-reality systems can register object interactions at the exact frame of contact rather than after a noticeable delay.
Robot learning pipelines can segment human demonstrations into pre-contact approach and post-contact manipulation phases with frame-level accuracy.
Assistive devices for users with motor impairments gain the ability to confirm touch events without requiring additional sensors.
Fine-grained action recognition in first-person video improves when contact onset serves as a reliable temporal anchor for subsequent motion analysis.
The released TouchMoment dataset supplies a common benchmark for comparing future contact-detection algorithms under controlled tolerance windows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hand-centric attention pattern could be tested on non-egocentric footage to determine whether the performance edge depends on the first-person viewpoint or generalizes to third-person observation.
Combining the module with explicit 3D hand-pose tracking might reduce errors in heavily occluded scenes where 2D appearance alone is ambiguous.
Extending the grasp-aware loss to other object-interaction events, such as tool grasp or release, would test whether the core mechanism applies beyond simple touch.
Real-time deployment on wearable devices would require measuring whether the cross-attention computation fits within typical mobile GPU budgets while preserving the two-frame accuracy.

Load-bearing premise

Hand regions together with nearby visual context contain reliable signals that cross-attention can use to separate actual contact from near-contact motion even under occlusion and fine manipulation.

What would settle it

Removing the cross-attention pathway between hand features and context or ablating the grasp-aware loss on the TouchMoment dataset and finding that average precision under two-frame tolerance falls to the level of standard event-spotting baselines would falsify the claim that these components drive the reported gains.

Figures

Figures reproduced from arXiv: 2604.12343 by Feras Dayoub, Huy Anh Nguyen, Minh Hoai.

**Figure 2.** Figure 2: The Hand-informed Context Enhanced (HiCE) module augments the feature extractor with a parallel hand-patch branch that processes left and right hand crops alongside global frame features using RegNet-Y backbones. Hand patches are expanded and encoded with positional and identity embeddings, then used as keys and values in a multi-head cross-attention block where global tokens act as queries. The resulting … view at source ↗

**Figure 3.** Figure 3: Qualitative examples of T-DEED with HiCE on HOI4D (a) and TACO (b). For each example, the top plot shows raw prediction score and the bottom plot shows predictions after NMS/SNMS. Ground-truth touch frames are indicated by red dashed lines, with a tolerance window of δ = 2 frames. touch frames are tightly clustered and the narrow displacement window already enforces sharp supervision. Temporal offset refi… view at source ↗

read the original abstract

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a new dataset for exact hand contact timing in egocentric video and reports a 17-point AP lift with a hand-focused attention module, but the gain hinges on unverified baseline re-training details.

read the letter

This paper's core contribution is the TouchMoment dataset of 4,021 egocentric videos with 8,456 frame-precise contact annotations, paired with a HiCE module that runs cross-attention over hand regions and surrounding context plus a grasp-aware loss with soft labels. They claim this beats standard event-spotting baselines by 16.91% average precision when predictions must land within two frames of ground truth. That strict tolerance matches real needs in AR and robot learning where contact onset matters for action triggers. The dataset itself looks like the most reusable part; anyone building temporal models for manipulation will appreciate having 1M+ frames with explicit contact labels instead of coarse action segments. The HiCE design is a reasonable specialization of attention for this cue, and the grasp-aware term tries to push the model toward pose and motion patterns that distinguish near-contact from actual touch. Those choices are sensible given the challenges listed in the abstract. The main soft spot is the experimental comparison. The abstract states the gain but does not confirm that every baseline was retrained from scratch on the identical splits, with the same data augmentations, optimizer schedule, and exact two-frame tolerance metric. If the baselines kept their original hyperparameters or if contact labels were resolved differently for occluded frames, the delta could be partly an artifact. Annotation reliability for sub-second contact moments is also load-bearing; small shifts in how annotators marked the onset could move the numbers. No external validation set or cross-dataset test is mentioned, so the result stays tied to this one collection. Readers working on egocentric vision, fine-grained event detection, or HCI datasets will find the data and task definition useful even if they adapt their own models. The work is coherent on its own terms and engages the right prior literature on event spotting. It deserves a serious referee who can check the training protocol and ask for ablations on the loss components. I would send it to review rather than desk reject.

Referee Report

2 major / 3 minor

Summary. The paper addresses precise frame-level detection of hand-object contact moments in egocentric video, a task relevant to AR, HCI, and robotics. It introduces the HiCE module, which applies cross-attention between hand-region features and surrounding spatiotemporal context to capture contact patterns, and augments training with a grasp-aware loss using soft labels that emphasize hand-pose dynamics near contact. The authors release the TouchMoment dataset (4,021 videos, 8,456 annotated contacts over >1M frames) and report that their method outperforms state-of-the-art event-spotting baselines by 16.91% average precision under a strict two-frame tolerance evaluation criterion.

Significance. If the reported gains prove robust, the work would advance fine-grained temporal localization in first-person vision and provide a useful benchmark dataset. The hand-informed cross-attention design is a plausible inductive bias for contact detection, and the grasp-aware loss targets a known difficulty (distinguishing near-contact from contact frames). Dataset release is a clear positive contribution.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The 16.91% AP gain under two-frame tolerance is the central empirical claim. It is unclear whether the event-spotting baselines were re-implemented and trained from scratch on the identical TouchMoment train/val/test splits (4,021 videos) using the same data augmentations, optimizer, and label-resolution rules for occluded or ambiguous contacts. Any mismatch would undermine the delta; please report training protocols and hyper-parameters for all baselines side-by-side.
[§3.2] §3.2 (Grasp-aware loss): The soft-label formulation is load-bearing. If the grasp-aware term simply re-weights frames within a fixed window around annotated contacts, the improvement could be an artifact of label smoothing rather than learned contact-specific patterns. An ablation that isolates the soft-label component against standard binary cross-entropy (with identical HiCE backbone) is required to substantiate the claim.

minor comments (3)

[§2] §2 (Related Work): The positioning against prior egocentric action and contact datasets (e.g., EPIC-KITCHENS, Something-Something) would benefit from a brief table comparing annotation granularity and temporal precision.
[§4.3] §4.3 (Evaluation metric): Explicitly define how the two-frame tolerance is implemented (e.g., whether a prediction is credited if any frame in [GT-2, GT+2] is selected, or only the closest frame).
[Figure 3 and §3.1] Figure 3 and §3.1: The cross-attention diagram would be clearer if the query/key/value dimensions and the hand-region masking strategy were annotated directly on the figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to clarify our experimental setup and strengthen the empirical claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The 16.91% AP gain under two-frame tolerance is the central empirical claim. It is unclear whether the event-spotting baselines were re-implemented and trained from scratch on the identical TouchMoment train/val/test splits (4,021 videos) using the same data augmentations, optimizer, and label-resolution rules for occluded or ambiguous contacts. Any mismatch would undermine the delta; please report training protocols and hyper-parameters for all baselines side-by-side.

Authors: All baselines were re-implemented from scratch and trained on the identical TouchMoment train/val/test splits using the same data augmentations, optimizer, and label-resolution rules for occluded or ambiguous contacts. We will add a side-by-side table of training protocols and hyper-parameters in the revised §4 to make this fully transparent. revision: yes
Referee: [§3.2] §3.2 (Grasp-aware loss): The soft-label formulation is load-bearing. If the grasp-aware term simply re-weights frames within a fixed window around annotated contacts, the improvement could be an artifact of label smoothing rather than learned contact-specific patterns. An ablation that isolates the soft-label component against standard binary cross-entropy (with identical HiCE backbone) is required to substantiate the claim.

Authors: We agree that isolating the soft-label contribution is necessary. We performed an additional ablation using the identical HiCE backbone that compares the grasp-aware loss with soft labels against standard binary cross-entropy. The soft-label version yields a clear additional gain, indicating the benefit arises from modeling hand-pose dynamics rather than generic smoothing. We will insert this ablation into the revised §3.2 and experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on new dataset with standard supervised training

full rationale

The paper introduces a new dataset (TouchMoment) and a neural architecture (HiCE with cross-attention and grasp-aware loss) whose performance is measured by average precision on held-out video splits. No mathematical derivation chain is claimed; the central result is an empirical delta (16.91% AP) obtained by training and evaluating models on the same annotated data. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The method description is a standard supervised-learning pipeline whose correctness is externally falsifiable via the released dataset and re-implementation of baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of the newly proposed HiCE cross-attention module and grasp-aware loss, which are introduced without independent prior evidence beyond the reported experiments.

free parameters (1)

neural network hyperparameters and loss weights
Standard deep learning training parameters fitted to the TouchMoment data.

invented entities (2)

HiCE module no independent evidence
purpose: Leverage spatiotemporal hand context via cross-attention for contact detection
New architecture component proposed in the paper.
grasp-aware loss no independent evidence
purpose: Emphasize hand pose and movement dynamics for distinguishing contact frames
New loss formulation proposed in the paper.

pith-pipeline@v0.9.0 · 5530 in / 1219 out tokens · 96574 ms · 2026-05-10T15:13:41.832593+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms — improving object detection with one line of code. InProceedings of the International Con- ference on Computer Vision, 2017. 6

work page 2017
[2]

Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles

S. Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. SST: Single-stream temporal ac- tion proposals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

work page 2017
[3]

Black, and Dim- itrios Tzionas

Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, and Dim- itrios Tzionas. Detecting human-object contact in images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 3

work page 2023
[4]

Tianyi Cheng, Dandan Shan, Ayda Sultan, Richard E. L. Higgins, and David F. Fouhey. Towards a richer 2d under- standing of hands at scale. InAdvances in Neural Informa- tion Processing Systems, 2023. 2, 3, 5

work page 2023
[5]

Empirical evaluation of gated recurrent neu- ral networks on sequence modeling.ArXiv, 2014

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neu- ral networks on sequence modeling.ArXiv, 2014. 3

work page 2014
[6]

EPIC-KITCHENS VISOR Benchmark: Video seg- mentations and object relations

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR Benchmark: Video seg- mentations and object relations. InAdvances in Neural In- formation Processing Systems, 2022. 3

work page 2022
[7]

Seikavandi, Jacob V

Adrien Deli `ege, Anthony Cioppa, Silvio Giancola, Meisam J. Seikavandi, Jacob V . Dueholm, Kamal Nas- rollahi, Bernard Ghanem, Thomas B. Moeslund, and Marc Van Droogenbroeck. SoccerNet-v2: A dataset and bench- marks for holistic understanding of broadcast soccer videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Works...

work page 2021
[8]

COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers

Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, and Romain H ´erault. COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers. InProceedings of the IEEE Winter Conference on Applications of Computer Vision Workshops, 2024. 2, 5, 6

work page 2024
[9]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the International Conference on Computer Vision, 2019. 2

work page 2019
[10]

SoccerNet: A scalable dataset for action spotting in soccer videos

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2018. 2

work page 2018
[11]

Human hands as probes for interactive object under- standing

Mohit Goyal, Sahil Modi, Rishabh Goyal, and Saurabh Gupta. Human hands as probes for interactive object under- standing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2021. 3

work page 2021
[12]

GTA: Global Temporal Atten- tion for Video Action Understanding

Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, and Abhinav Shrivastava. GTA: Global Temporal Atten- tion for Video Action Understanding. InProceedings of the British Machine Vision Conference, 2020. 2

work page 2020
[13]

Spotting temporally precise, fine-grained events in video

James Hong, Haotian Zhang, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Spotting temporally precise, fine-grained events in video. InProceedings of the European Conference on Computer Vision, 2022. 2, 3, 4, 6

work page 2022
[14]

Forward propagation, back- ward regression, and pose association for hand tracking in the wild

Mingzhen Huang, Supreeth Narasimhaswamy, Saif Vazir, Haibin Ling, and Minh Hoai. Forward propagation, back- ward regression, and pose association for hand tracking in the wild. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2022. 3

work page 2022
[15]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 3

work page 2022
[16]

Focal loss for dense object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020. 6

work page 2020
[17]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning and Representation, 2019. 6

work page 2019
[18]

Contextual attention for hand detection in the wild

Supreeth Narasimhaswamy, Zhengwei Wei, Yang Wang, Justin Zhang, and Minh Hoai. Contextual attention for hand detection in the wild. InProceedings of the International Conference on Computer Vision, 2019. 3

work page 2019
[19]

Detecting hands and recognizing physical contact in the wild

Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai. Detecting hands and recognizing physical contact in the wild. InAdvances in Neural Information Processing Sys- tems, 2020. 3

work page 2020
[20]

Whose hands are these? hand de- tection and hand-body association in the wild

Supreeth Narasimhaswamy, Thanh Nguyen, Mingzhen Huang, and Minh Hoai. Whose hands are these? hand de- tection and hand-body association in the wild. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 3

work page 2022
[21]

HOIST-Former: Hand-held objects identification segmentation and tracking in the wild

Supreeth Narasimhaswamy, Huy Anh Nguyen, Lihan Huang, and Minh Hoai. HOIST-Former: Hand-held objects identification segmentation and tracking in the wild. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024
[22]

Video transformer network

Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Assel- mann. Video transformer network. InProceedings of the International Conference on Computer Vision Workshops,

work page
[23]

3d hand pose estimation in everyday egocentric im- ages

Aditya Prakash, Ruisen Tu, Matthew Chang, and Saurabh Gupta. 3d hand pose estimation in everyday egocentric im- ages. InProceedings of the European Conference on Com- puter Vision, 2024. 5

work page 2024
[24]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 6

work page 2020
[25]

Dandan Shan, Jiaqi Geng, Michelle Shu, and David F. Fouhey. Understanding Human Hands in Contact at Internet 9 Scale. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

work page 2020
[26]

TriDet: Temporal action detection with rela- tive boundary modeling

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. TriDet: Temporal action detection with rela- tive boundary modeling. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2023. 2, 3, 5

work page 2023
[27]

Jo ˜ao V . B. Soares and Avijit Shah. Action spotting using dense detection anchors revisited: Submission to the soccer- net challenge 2022.ArXiv, 2022. 2, 5, 6

work page 2022
[28]

Gate-Shift Networks for Video Action Recognition

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-Shift Networks for Video Action Recognition . InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

work page 2020
[29]

Gate-shift-fuse for video action recognition.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(9): 10913–10928, 2023

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-shift-fuse for video action recognition.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(9): 10913–10928, 2023. 3

work page 2023
[30]

Video Classification With Channel-Separated Convolutional Networks

Du Tran, Heng Wang, Matt Feiszli, and Lorenzo Torresani. Video Classification With Channel-Separated Convolutional Networks . InProceedings of the International Conference on Computer Vision, 2019. 2

work page 2019
[31]

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, and Ngan Le. Unifying Global and Local Scene Entities Modelling for Precise Action Spotting. InProceedings of the International Joint Conference on Neural Networks, 2024. 3, 6

work page 2024
[32]

Khoa V o, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, and Ngan Le. Aoe-net: Entities interac- tions modeling with adaptive attention mechanism for tem- poral action proposals generation.International Journal of Computer Vision, 131(1):302–323, 2022. 3

work page 2022
[33]

Resnet strikes back: An improved training procedure in timm

Ross Wightman, Hugo Touvron, and Herv ´e Jegou. Resnet strikes back: An improved training procedure in timm. ArXiv, 2021. 6

work page 2021
[34]

Moeslund, and Al- bert Clap´es

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Al- bert Clap´es. Astra: An action spotting transformer for soccer videos. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 2, 3, 5, 6

work page 2023
[35]

Moeslund, and Albert Clap ´es

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Albert Clap ´es. T-DEED: Temporal-Discriminability En- hancer Encoder-Decoder for Precise Event Spotting in Sports Videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 3, 4, 5, 6

work page 2024
[36]

Deep learning for sports video event detection: Tasks, datasets, methods, and challenges.ArXiv, 2025

Hao Xu, Arbind Agrahari Baniya, Sam Well, Mo- hamed Reda Bouadjenek, Richard Dazeley, and Sunil Aryal. Deep learning for sports video event detection: Tasks, datasets, methods, and challenges.ArXiv, 2025. 2

work page 2025
[37]

Rojas, Ali Thabet, and Bernard Ghanem

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. G-TAD: Sub-Graph Localization for Tem- poral Action Detection. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2020. 2

work page 2020
[38]

Temporal Pyramid Network for Action Recognition

Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal Pyramid Network for Action Recognition . InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2

work page 2020
[39]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. InProceedings of the British Machine Vision Conference, 2021. 2

work page 2021
[40]

Actionformer: Localizing moments of actions with transformers

Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Localizing moments of actions with transformers. InPro- ceedings of the European Conference on Computer Vision,

work page
[41]

mixup: Beyond empirical risk minimization

Hongyi Zhang, Moustapha Ciss ´e, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of International Conference on Learning and Representation, 2018. 6

work page 2018
[42]

GLIPv2: Uni- fying localization and vision-language understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. GLIPv2: Uni- fying localization and vision-language understanding. InAd- vances in Neural Information Processing Systems, 2022. 3

work page 2022
[43]

Feature combination meets attention: Baidu soccer embed- dings and transformer based temporal detection.ArXiv,

Xin Zhou, Le Kang, Zhiyu Cheng, Bo He, and Jingyu Xin. Feature combination meets attention: Baidu soccer embed- dings and transformer based temporal detection.ArXiv,

work page

[1] [1]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms — improving object detection with one line of code. InProceedings of the International Con- ference on Computer Vision, 2017. 6

work page 2017

[2] [2]

Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles

S. Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. SST: Single-stream temporal ac- tion proposals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

work page 2017

[3] [3]

Black, and Dim- itrios Tzionas

Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, and Dim- itrios Tzionas. Detecting human-object contact in images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 3

work page 2023

[4] [4]

Tianyi Cheng, Dandan Shan, Ayda Sultan, Richard E. L. Higgins, and David F. Fouhey. Towards a richer 2d under- standing of hands at scale. InAdvances in Neural Informa- tion Processing Systems, 2023. 2, 3, 5

work page 2023

[5] [5]

Empirical evaluation of gated recurrent neu- ral networks on sequence modeling.ArXiv, 2014

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neu- ral networks on sequence modeling.ArXiv, 2014. 3

work page 2014

[6] [6]

EPIC-KITCHENS VISOR Benchmark: Video seg- mentations and object relations

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR Benchmark: Video seg- mentations and object relations. InAdvances in Neural In- formation Processing Systems, 2022. 3

work page 2022

[7] [7]

Seikavandi, Jacob V

Adrien Deli `ege, Anthony Cioppa, Silvio Giancola, Meisam J. Seikavandi, Jacob V . Dueholm, Kamal Nas- rollahi, Bernard Ghanem, Thomas B. Moeslund, and Marc Van Droogenbroeck. SoccerNet-v2: A dataset and bench- marks for holistic understanding of broadcast soccer videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Works...

work page 2021

[8] [8]

COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers

Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, and Romain H ´erault. COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers. InProceedings of the IEEE Winter Conference on Applications of Computer Vision Workshops, 2024. 2, 5, 6

work page 2024

[9] [9]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the International Conference on Computer Vision, 2019. 2

work page 2019

[10] [10]

SoccerNet: A scalable dataset for action spotting in soccer videos

Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2018. 2

work page 2018

[11] [11]

Human hands as probes for interactive object under- standing

Mohit Goyal, Sahil Modi, Rishabh Goyal, and Saurabh Gupta. Human hands as probes for interactive object under- standing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2021. 3

work page 2021

[12] [12]

GTA: Global Temporal Atten- tion for Video Action Understanding

Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, and Abhinav Shrivastava. GTA: Global Temporal Atten- tion for Video Action Understanding. InProceedings of the British Machine Vision Conference, 2020. 2

work page 2020

[13] [13]

Spotting temporally precise, fine-grained events in video

James Hong, Haotian Zhang, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Spotting temporally precise, fine-grained events in video. InProceedings of the European Conference on Computer Vision, 2022. 2, 3, 4, 6

work page 2022

[14] [14]

Forward propagation, back- ward regression, and pose association for hand tracking in the wild

Mingzhen Huang, Supreeth Narasimhaswamy, Saif Vazir, Haibin Ling, and Minh Hoai. Forward propagation, back- ward regression, and pose association for hand tracking in the wild. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2022. 3

work page 2022

[15] [15]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 3

work page 2022

[16] [16]

Focal loss for dense object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020. 6

work page 2020

[17] [17]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning and Representation, 2019. 6

work page 2019

[18] [18]

Contextual attention for hand detection in the wild

Supreeth Narasimhaswamy, Zhengwei Wei, Yang Wang, Justin Zhang, and Minh Hoai. Contextual attention for hand detection in the wild. InProceedings of the International Conference on Computer Vision, 2019. 3

work page 2019

[19] [19]

Detecting hands and recognizing physical contact in the wild

Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai. Detecting hands and recognizing physical contact in the wild. InAdvances in Neural Information Processing Sys- tems, 2020. 3

work page 2020

[20] [20]

Whose hands are these? hand de- tection and hand-body association in the wild

Supreeth Narasimhaswamy, Thanh Nguyen, Mingzhen Huang, and Minh Hoai. Whose hands are these? hand de- tection and hand-body association in the wild. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 3

work page 2022

[21] [21]

HOIST-Former: Hand-held objects identification segmentation and tracking in the wild

Supreeth Narasimhaswamy, Huy Anh Nguyen, Lihan Huang, and Minh Hoai. HOIST-Former: Hand-held objects identification segmentation and tracking in the wild. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024. 3

work page 2024

[22] [22]

Video transformer network

Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Assel- mann. Video transformer network. InProceedings of the International Conference on Computer Vision Workshops,

work page

[23] [23]

3d hand pose estimation in everyday egocentric im- ages

Aditya Prakash, Ruisen Tu, Matthew Chang, and Saurabh Gupta. 3d hand pose estimation in everyday egocentric im- ages. InProceedings of the European Conference on Com- puter Vision, 2024. 5

work page 2024

[24] [24]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 6

work page 2020

[25] [25]

Dandan Shan, Jiaqi Geng, Michelle Shu, and David F. Fouhey. Understanding Human Hands in Contact at Internet 9 Scale. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

work page 2020

[26] [26]

TriDet: Temporal action detection with rela- tive boundary modeling

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. TriDet: Temporal action detection with rela- tive boundary modeling. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2023. 2, 3, 5

work page 2023

[27] [27]

Jo ˜ao V . B. Soares and Avijit Shah. Action spotting using dense detection anchors revisited: Submission to the soccer- net challenge 2022.ArXiv, 2022. 2, 5, 6

work page 2022

[28] [28]

Gate-Shift Networks for Video Action Recognition

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-Shift Networks for Video Action Recognition . InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

work page 2020

[29] [29]

Gate-shift-fuse for video action recognition.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(9): 10913–10928, 2023

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-shift-fuse for video action recognition.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(9): 10913–10928, 2023. 3

work page 2023

[30] [30]

Video Classification With Channel-Separated Convolutional Networks

Du Tran, Heng Wang, Matt Feiszli, and Lorenzo Torresani. Video Classification With Channel-Separated Convolutional Networks . InProceedings of the International Conference on Computer Vision, 2019. 2

work page 2019

[31] [31]

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, and Ngan Le. Unifying Global and Local Scene Entities Modelling for Precise Action Spotting. InProceedings of the International Joint Conference on Neural Networks, 2024. 3, 6

work page 2024

[32] [32]

Khoa V o, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, and Ngan Le. Aoe-net: Entities interac- tions modeling with adaptive attention mechanism for tem- poral action proposals generation.International Journal of Computer Vision, 131(1):302–323, 2022. 3

work page 2022

[33] [33]

Resnet strikes back: An improved training procedure in timm

Ross Wightman, Hugo Touvron, and Herv ´e Jegou. Resnet strikes back: An improved training procedure in timm. ArXiv, 2021. 6

work page 2021

[34] [34]

Moeslund, and Al- bert Clap´es

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Al- bert Clap´es. Astra: An action spotting transformer for soccer videos. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 2, 3, 5, 6

work page 2023

[35] [35]

Moeslund, and Albert Clap ´es

Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Albert Clap ´es. T-DEED: Temporal-Discriminability En- hancer Encoder-Decoder for Precise Event Spotting in Sports Videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 3, 4, 5, 6

work page 2024

[36] [36]

Deep learning for sports video event detection: Tasks, datasets, methods, and challenges.ArXiv, 2025

Hao Xu, Arbind Agrahari Baniya, Sam Well, Mo- hamed Reda Bouadjenek, Richard Dazeley, and Sunil Aryal. Deep learning for sports video event detection: Tasks, datasets, methods, and challenges.ArXiv, 2025. 2

work page 2025

[37] [37]

Rojas, Ali Thabet, and Bernard Ghanem

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. G-TAD: Sub-Graph Localization for Tem- poral Action Detection. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2020. 2

work page 2020

[38] [38]

Temporal Pyramid Network for Action Recognition

Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal Pyramid Network for Action Recognition . InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2

work page 2020

[39] [39]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. InProceedings of the British Machine Vision Conference, 2021. 2

work page 2021

[40] [40]

Actionformer: Localizing moments of actions with transformers

Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Localizing moments of actions with transformers. InPro- ceedings of the European Conference on Computer Vision,

work page

[41] [41]

mixup: Beyond empirical risk minimization

Hongyi Zhang, Moustapha Ciss ´e, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of International Conference on Learning and Representation, 2018. 6

work page 2018

[42] [42]

GLIPv2: Uni- fying localization and vision-language understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. GLIPv2: Uni- fying localization and vision-language understanding. InAd- vances in Neural Information Processing Systems, 2022. 3

work page 2022

[43] [43]

Feature combination meets attention: Baidu soccer embed- dings and transformer based temporal detection.ArXiv,

Xin Zhou, Le Kang, Zhiyu Cheng, Bo He, and Jingyu Xin. Feature combination meets attention: Baidu soccer embed- dings and transformer based temporal detection.ArXiv,

work page