pith. sign in

arxiv: 2604.12343 · v1 · submitted 2026-04-14 · 💻 cs.CV

Detecting Precise Hand Touch Moments in Egocentric Video

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric videohand-object contact detectionevent spottingcross-attentiongrasp-aware lossTouchMoment datasetfine-grained manipulationfirst-person vision
0
0 comments X

The pith

A hand-informed context module detects exact contact frames in egocentric video by combining hand-region features with surrounding context via cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem of spotting the single video frame when a hand first touches an object, viewed from the wearer's own camera. Precise timing matters because many downstream systems in augmented reality, robot learning from human demonstrations, and assistive interfaces trigger actions or record events exactly at contact onset. The proposed solution builds a module that extracts spatiotemporal features from detected hand areas and their local surroundings, then uses cross-attention to highlight patterns that signal actual touch rather than near-miss motion. A grasp-aware loss further trains the model to emphasize hand-pose dynamics that distinguish contact from non-contact frames. The authors also release a new dataset of over four thousand egocentric videos containing thousands of annotated contact moments, allowing direct measurement of performance under a strict two-frame tolerance window.

Core claim

The Hand-informed Context Enhanced module learns to identify contact moments by fusing hand-region and contextual features through cross-attention and by applying a grasp-aware loss that penalizes confusion between near-contact and true-contact frames, producing a 16.91 percent gain in average precision over prior event-spotting methods when predictions must fall within two frames of ground truth on the TouchMoment dataset.

What carries the argument

The Hand-informed Context Enhanced (HiCE) module, which applies cross-attention between spatiotemporal features extracted from hand regions and their surrounding context, augmented by a grasp-aware loss and soft labels that emphasize pose and motion patterns at the instant of contact.

If this is right

  • Augmented-reality systems can register object interactions at the exact frame of contact rather than after a noticeable delay.
  • Robot learning pipelines can segment human demonstrations into pre-contact approach and post-contact manipulation phases with frame-level accuracy.
  • Assistive devices for users with motor impairments gain the ability to confirm touch events without requiring additional sensors.
  • Fine-grained action recognition in first-person video improves when contact onset serves as a reliable temporal anchor for subsequent motion analysis.
  • The released TouchMoment dataset supplies a common benchmark for comparing future contact-detection algorithms under controlled tolerance windows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hand-centric attention pattern could be tested on non-egocentric footage to determine whether the performance edge depends on the first-person viewpoint or generalizes to third-person observation.
  • Combining the module with explicit 3D hand-pose tracking might reduce errors in heavily occluded scenes where 2D appearance alone is ambiguous.
  • Extending the grasp-aware loss to other object-interaction events, such as tool grasp or release, would test whether the core mechanism applies beyond simple touch.
  • Real-time deployment on wearable devices would require measuring whether the cross-attention computation fits within typical mobile GPU budgets while preserving the two-frame accuracy.

Load-bearing premise

Hand regions together with nearby visual context contain reliable signals that cross-attention can use to separate actual contact from near-contact motion even under occlusion and fine manipulation.

What would settle it

Removing the cross-attention pathway between hand features and context or ablating the grasp-aware loss on the TouchMoment dataset and finding that average precision under two-frame tolerance falls to the level of standard event-spotting baselines would falsify the claim that these components drive the reported gains.

Figures

Figures reproduced from arXiv: 2604.12343 by Feras Dayoub, Huy Anh Nguyen, Minh Hoai.

Figure 1
Figure 1. Figure 1: We develop a model to detect the precise touch mo [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Hand-informed Context Enhanced (HiCE) module augments the feature extractor with a parallel hand-patch branch that processes left and right hand crops alongside global frame features using RegNet-Y backbones. Hand patches are expanded and encoded with positional and identity embeddings, then used as keys and values in a multi-head cross-attention block where global tokens act as queries. The resulting … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of T-DEED with HiCE on HOI4D (a) and TACO (b). For each example, the top plot shows raw prediction score and the bottom plot shows predictions after NMS/SNMS. Ground-truth touch frames are indicated by red dashed lines, with a tolerance window of δ = 2 frames. touch frames are tightly clustered and the narrow displace￾ment window already enforces sharp supervision. Temporal offset refi… view at source ↗
read the original abstract

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper addresses precise frame-level detection of hand-object contact moments in egocentric video, a task relevant to AR, HCI, and robotics. It introduces the HiCE module, which applies cross-attention between hand-region features and surrounding spatiotemporal context to capture contact patterns, and augments training with a grasp-aware loss using soft labels that emphasize hand-pose dynamics near contact. The authors release the TouchMoment dataset (4,021 videos, 8,456 annotated contacts over >1M frames) and report that their method outperforms state-of-the-art event-spotting baselines by 16.91% average precision under a strict two-frame tolerance evaluation criterion.

Significance. If the reported gains prove robust, the work would advance fine-grained temporal localization in first-person vision and provide a useful benchmark dataset. The hand-informed cross-attention design is a plausible inductive bias for contact detection, and the grasp-aware loss targets a known difficulty (distinguishing near-contact from contact frames). Dataset release is a clear positive contribution.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The 16.91% AP gain under two-frame tolerance is the central empirical claim. It is unclear whether the event-spotting baselines were re-implemented and trained from scratch on the identical TouchMoment train/val/test splits (4,021 videos) using the same data augmentations, optimizer, and label-resolution rules for occluded or ambiguous contacts. Any mismatch would undermine the delta; please report training protocols and hyper-parameters for all baselines side-by-side.
  2. [§3.2] §3.2 (Grasp-aware loss): The soft-label formulation is load-bearing. If the grasp-aware term simply re-weights frames within a fixed window around annotated contacts, the improvement could be an artifact of label smoothing rather than learned contact-specific patterns. An ablation that isolates the soft-label component against standard binary cross-entropy (with identical HiCE backbone) is required to substantiate the claim.
minor comments (3)
  1. [§2] §2 (Related Work): The positioning against prior egocentric action and contact datasets (e.g., EPIC-KITCHENS, Something-Something) would benefit from a brief table comparing annotation granularity and temporal precision.
  2. [§4.3] §4.3 (Evaluation metric): Explicitly define how the two-frame tolerance is implemented (e.g., whether a prediction is credited if any frame in [GT-2, GT+2] is selected, or only the closest frame).
  3. [Figure 3 and §3.1] Figure 3 and §3.1: The cross-attention diagram would be clearer if the query/key/value dimensions and the hand-region masking strategy were annotated directly on the figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to clarify our experimental setup and strengthen the empirical claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The 16.91% AP gain under two-frame tolerance is the central empirical claim. It is unclear whether the event-spotting baselines were re-implemented and trained from scratch on the identical TouchMoment train/val/test splits (4,021 videos) using the same data augmentations, optimizer, and label-resolution rules for occluded or ambiguous contacts. Any mismatch would undermine the delta; please report training protocols and hyper-parameters for all baselines side-by-side.

    Authors: All baselines were re-implemented from scratch and trained on the identical TouchMoment train/val/test splits using the same data augmentations, optimizer, and label-resolution rules for occluded or ambiguous contacts. We will add a side-by-side table of training protocols and hyper-parameters in the revised §4 to make this fully transparent. revision: yes

  2. Referee: [§3.2] §3.2 (Grasp-aware loss): The soft-label formulation is load-bearing. If the grasp-aware term simply re-weights frames within a fixed window around annotated contacts, the improvement could be an artifact of label smoothing rather than learned contact-specific patterns. An ablation that isolates the soft-label component against standard binary cross-entropy (with identical HiCE backbone) is required to substantiate the claim.

    Authors: We agree that isolating the soft-label contribution is necessary. We performed an additional ablation using the identical HiCE backbone that compares the grasp-aware loss with soft labels against standard binary cross-entropy. The soft-label version yields a clear additional gain, indicating the benefit arises from modeling hand-pose dynamics rather than generic smoothing. We will insert this ablation into the revised §3.2 and experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on new dataset with standard supervised training

full rationale

The paper introduces a new dataset (TouchMoment) and a neural architecture (HiCE with cross-attention and grasp-aware loss) whose performance is measured by average precision on held-out video splits. No mathematical derivation chain is claimed; the central result is an empirical delta (16.91% AP) obtained by training and evaluating models on the same annotated data. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The method description is a standard supervised-learning pipeline whose correctness is externally falsifiable via the released dataset and re-implementation of baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of the newly proposed HiCE cross-attention module and grasp-aware loss, which are introduced without independent prior evidence beyond the reported experiments.

free parameters (1)
  • neural network hyperparameters and loss weights
    Standard deep learning training parameters fitted to the TouchMoment data.
invented entities (2)
  • HiCE module no independent evidence
    purpose: Leverage spatiotemporal hand context via cross-attention for contact detection
    New architecture component proposed in the paper.
  • grasp-aware loss no independent evidence
    purpose: Emphasize hand pose and movement dynamics for distinguishing contact frames
    New loss formulation proposed in the paper.

pith-pipeline@v0.9.0 · 5530 in / 1219 out tokens · 96574 ms · 2026-05-10T15:13:41.832593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms — improving object detection with one line of code. InProceedings of the International Con- ference on Computer Vision, 2017. 6

  2. [2]

    Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles

    S. Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. SST: Single-stream temporal ac- tion proposals. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2

  3. [3]

    Black, and Dim- itrios Tzionas

    Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, and Dim- itrios Tzionas. Detecting human-object contact in images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. 3

  4. [4]

    Tianyi Cheng, Dandan Shan, Ayda Sultan, Richard E. L. Higgins, and David F. Fouhey. Towards a richer 2d under- standing of hands at scale. InAdvances in Neural Informa- tion Processing Systems, 2023. 2, 3, 5

  5. [5]

    Empirical evaluation of gated recurrent neu- ral networks on sequence modeling.ArXiv, 2014

    Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neu- ral networks on sequence modeling.ArXiv, 2014. 3

  6. [6]

    EPIC-KITCHENS VISOR Benchmark: Video seg- mentations and object relations

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR Benchmark: Video seg- mentations and object relations. InAdvances in Neural In- formation Processing Systems, 2022. 3

  7. [7]

    Seikavandi, Jacob V

    Adrien Deli `ege, Anthony Cioppa, Silvio Giancola, Meisam J. Seikavandi, Jacob V . Dueholm, Kamal Nas- rollahi, Bernard Ghanem, Thomas B. Moeslund, and Marc Van Droogenbroeck. SoccerNet-v2: A dataset and bench- marks for holistic understanding of broadcast soccer videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Works...

  8. [8]

    COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers

    Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, and Romain H ´erault. COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers. InProceedings of the IEEE Winter Conference on Applications of Computer Vision Workshops, 2024. 2, 5, 6

  9. [9]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the International Conference on Computer Vision, 2019. 2

  10. [10]

    SoccerNet: A scalable dataset for action spotting in soccer videos

    Silvio Giancola, Mohieddine Amine, Tarek Dghaily, and Bernard Ghanem. SoccerNet: A scalable dataset for action spotting in soccer videos. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops, 2018. 2

  11. [11]

    Human hands as probes for interactive object under- standing

    Mohit Goyal, Sahil Modi, Rishabh Goyal, and Saurabh Gupta. Human hands as probes for interactive object under- standing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2021. 3

  12. [12]

    GTA: Global Temporal Atten- tion for Video Action Understanding

    Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, and Abhinav Shrivastava. GTA: Global Temporal Atten- tion for Video Action Understanding. InProceedings of the British Machine Vision Conference, 2020. 2

  13. [13]

    Spotting temporally precise, fine-grained events in video

    James Hong, Haotian Zhang, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Spotting temporally precise, fine-grained events in video. InProceedings of the European Conference on Computer Vision, 2022. 2, 3, 4, 6

  14. [14]

    Forward propagation, back- ward regression, and pose association for hand tracking in the wild

    Mingzhen Huang, Supreeth Narasimhaswamy, Saif Vazir, Haibin Ling, and Minh Hoai. Forward propagation, back- ward regression, and pose association for hand tracking in the wild. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2022. 3

  15. [15]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 3

  16. [16]

    Focal loss for dense object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020. 6

  17. [17]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of International Conference on Learning and Representation, 2019. 6

  18. [18]

    Contextual attention for hand detection in the wild

    Supreeth Narasimhaswamy, Zhengwei Wei, Yang Wang, Justin Zhang, and Minh Hoai. Contextual attention for hand detection in the wild. InProceedings of the International Conference on Computer Vision, 2019. 3

  19. [19]

    Detecting hands and recognizing physical contact in the wild

    Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai. Detecting hands and recognizing physical contact in the wild. InAdvances in Neural Information Processing Sys- tems, 2020. 3

  20. [20]

    Whose hands are these? hand de- tection and hand-body association in the wild

    Supreeth Narasimhaswamy, Thanh Nguyen, Mingzhen Huang, and Minh Hoai. Whose hands are these? hand de- tection and hand-body association in the wild. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 3

  21. [21]

    HOIST-Former: Hand-held objects identification segmentation and tracking in the wild

    Supreeth Narasimhaswamy, Huy Anh Nguyen, Lihan Huang, and Minh Hoai. HOIST-Former: Hand-held objects identification segmentation and tracking in the wild. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024. 3

  22. [22]

    Video transformer network

    Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Assel- mann. Video transformer network. InProceedings of the International Conference on Computer Vision Workshops,

  23. [23]

    3d hand pose estimation in everyday egocentric im- ages

    Aditya Prakash, Ruisen Tu, Matthew Chang, and Saurabh Gupta. 3d hand pose estimation in everyday egocentric im- ages. InProceedings of the European Conference on Com- puter Vision, 2024. 5

  24. [24]

    Designing network design spaces

    Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 6

  25. [25]

    Dandan Shan, Jiaqi Geng, Michelle Shu, and David F. Fouhey. Understanding Human Hands in Contact at Internet 9 Scale. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

  26. [26]

    TriDet: Temporal action detection with rela- tive boundary modeling

    Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. TriDet: Temporal action detection with rela- tive boundary modeling. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2023. 2, 3, 5

  27. [27]

    Jo ˜ao V . B. Soares and Avijit Shah. Action spotting using dense detection anchors revisited: Submission to the soccer- net challenge 2022.ArXiv, 2022. 2, 5, 6

  28. [28]

    Gate-Shift Networks for Video Action Recognition

    Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-Shift Networks for Video Action Recognition . InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3

  29. [29]

    Gate-shift-fuse for video action recognition.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(9): 10913–10928, 2023

    Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-shift-fuse for video action recognition.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(9): 10913–10928, 2023. 3

  30. [30]

    Video Classification With Channel-Separated Convolutional Networks

    Du Tran, Heng Wang, Matt Feiszli, and Lorenzo Torresani. Video Classification With Channel-Separated Convolutional Networks . InProceedings of the International Conference on Computer Vision, 2019. 2

  31. [31]

    Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

    Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, and Ngan Le. Unifying Global and Local Scene Entities Modelling for Precise Action Spotting. InProceedings of the International Joint Conference on Neural Networks, 2024. 3, 6

  32. [32]

    Khoa V o, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, and Ngan Le. Aoe-net: Entities interac- tions modeling with adaptive attention mechanism for tem- poral action proposals generation.International Journal of Computer Vision, 131(1):302–323, 2022. 3

  33. [33]

    Resnet strikes back: An improved training procedure in timm

    Ross Wightman, Hugo Touvron, and Herv ´e Jegou. Resnet strikes back: An improved training procedure in timm. ArXiv, 2021. 6

  34. [34]

    Moeslund, and Al- bert Clap´es

    Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Al- bert Clap´es. Astra: An action spotting transformer for soccer videos. InProceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023. 2, 3, 5, 6

  35. [35]

    Moeslund, and Albert Clap ´es

    Artur Xarles, Sergio Escalera, Thomas B. Moeslund, and Albert Clap ´es. T-DEED: Temporal-Discriminability En- hancer Encoder-Decoder for Precise Event Spotting in Sports Videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2024. 3, 4, 5, 6

  36. [36]

    Deep learning for sports video event detection: Tasks, datasets, methods, and challenges.ArXiv, 2025

    Hao Xu, Arbind Agrahari Baniya, Sam Well, Mo- hamed Reda Bouadjenek, Richard Dazeley, and Sunil Aryal. Deep learning for sports video event detection: Tasks, datasets, methods, and challenges.ArXiv, 2025. 2

  37. [37]

    Rojas, Ali Thabet, and Bernard Ghanem

    Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. G-TAD: Sub-Graph Localization for Tem- poral Action Detection. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2020. 2

  38. [38]

    Temporal Pyramid Network for Action Recognition

    Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal Pyramid Network for Action Recognition . InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 2

  39. [39]

    Asformer: Transformer for action segmentation

    Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. InProceedings of the British Machine Vision Conference, 2021. 2

  40. [40]

    Actionformer: Localizing moments of actions with transformers

    Chen-Lin Zhang, Jianxin Wu, and Yin Li. Actionformer: Localizing moments of actions with transformers. InPro- ceedings of the European Conference on Computer Vision,

  41. [41]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang, Moustapha Ciss ´e, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of International Conference on Learning and Representation, 2018. 6

  42. [42]

    GLIPv2: Uni- fying localization and vision-language understanding

    Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. GLIPv2: Uni- fying localization and vision-language understanding. InAd- vances in Neural Information Processing Systems, 2022. 3

  43. [43]

    Feature combination meets attention: Baidu soccer embed- dings and transformer based temporal detection.ArXiv,

    Xin Zhou, Le Kang, Zhiyu Cheng, Bo He, and Jingyu Xin. Feature combination meets attention: Baidu soccer embed- dings and transformer based temporal detection.ArXiv,