HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition

Andreas Dengel; Brian Moser; Ko Watanabe; Sachin Kumar Singh; Shoya Ishimaru

arxiv: 2511.22337 · v2 · submitted 2025-11-27 · 💻 cs.HC

HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition

Sachin Kumar Singh , Ko Watanabe , Brian Moser , Shoya Ishimaru , Andreas Dengel This is my paper

Pith reviewed 2026-05-17 05:01 UTC · model grok-4.3

classification 💻 cs.HC

keywords hand gesture recognitionreal-time annotationdata labelingskeleton preprocessingResNet50user studyHaGRID datasetmachine learning training data

0 comments

The pith

HandyLabel maps hand gestures to labels in real time so users annotate data during recording instead of afterward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HandyLabel as a web tool that lets users define custom hand signs and then labels incoming data on the fly through gesture recognition. Traditional post-recording tools force labelers to rely on memory for subjective categories such as emotion or comprehension, which the authors argue increases errors and time. By preprocessing hand images into skeletons and feeding them to ResNet50, the system reaches an F1-score of 0.923 on the HaGRID dataset. A study with 46 participants found that 88.9 percent preferred HandyLabel to conventional annotation interfaces.

Core claim

HandyLabel enables real-time annotation by mapping user-defined hand gestures to labels through skeleton-based recognition; ResNet50 on preprocessed skeleton images yields an F1-score of 0.923 on HaGRID, and 88.9 percent of 46 study participants favored the tool over post-processing alternatives.

What carries the argument

A web interface for customizing gesture-to-label mappings combined with real-time skeleton-preprocessed hand gesture recognition models.

If this is right

Annotation of large video or sensor streams no longer requires separate post-recording sessions.
Labelers can assign subjective categories such as emotions without relying on later memory recall.
Custom gesture sets can be tailored to specific domains like medical or educational data collection.
Faster iteration cycles become possible for machine-learning projects that depend on fresh labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gesture recognizer tolerates partial occlusions or varying camera angles, the same pipeline could support annotation on mobile devices during field recordings.
Pairing the gesture channel with brief voice confirmations might further lower error rates for complex labeling schemes.
The skeleton-preprocessing step could be reused in other real-time HCI applications that need low-latency hand input without full video transmission.

Load-bearing premise

Hand gesture recognition stays accurate and low-error when users actually record and label data in typical environments, and the reported user preference generalizes to ordinary annotation work.

What would settle it

Run the same 46-participant labeling task on a new dataset recorded in uncontrolled lighting and backgrounds; if accuracy falls below 0.80 F1 or preference drops below 70 percent, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2511.22337 by Andreas Dengel, Brian Moser, Ko Watanabe, Sachin Kumar Singh, Shoya Ishimaru.

**Figure 1.** Figure 1: Overview of the HandyLabel: Existing annotation tools perform through post-processing, where the user tries to make an annotation by recalling the target label (i.e., emotions). However, recalling might cause false memory and wrong annotation. Instead, our proposed application allows participants to make annotations during the experiment, enabling real-time precision. determine model performance. Pre-train… view at source ↗

**Figure 2.** Figure 2: Selected hand gestures (a) and preprocessing configurations for gesture [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: System overview. The workflow begins with real-time hand gesture [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: HandyLabel workflow: (a) Users select one-to-one relationship of the annotation label and hand sign, (b) Camera recording starts and hand sign recognition will be running in the backend and when model recognize the selected hand gestures, then the timestamp and the label will be stored as a log. (c) Lastly, after user stop recording, the dashboard shows the datetime/timestamp and the duration of each annot… view at source ↗

**Figure 5.** Figure 5: User preferences for HandyLabel compared to Label Studio. The first pie chart illustrates that the vast majority of users found HandyLabel more intuitive than Label Studio, with only a small percentage favoring Label Studio. The second chart reflects user preferences for data annotation, with 88.9% of participants indicating they would rather use HandyLabel for annotation tasks. The third chart shows that … view at source ↗

**Figure 6.** Figure 6: Comparison of user ratings for setup effort between [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

The success of machine learning is deeply linked to the availability of high-quality training data, yet retrieving and manually labeling new data remains a time-consuming and error-prone process. Traditional annotation tools, such as Label Studio, often require post-processing, where users label data after it has been recorded. Post-processing is highly time-consuming and labor-intensive, especially with large datasets, and may lead to erroneous annotations due to the difficulty of subjects' memory tasks when labeling cognitive activities such as emotions or comprehension levels. In this work, we introduce HandyLabel, a real-time annotation tool that leverages hand gesture recognition to map hand signs for labeling. The application enables users to customize gesture mappings through a web-based interface, allowing for real-time annotations. To ensure the performance of HandyLabel, we evaluate several hand gesture recognition models on an open-source hand sign (HaGRID) dataset, with and without skeleton-based preprocessing. We discovered that ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923. To validate the usability of HandyLabel, a user study was conducted with 46 participants. The results suggest that 88.9% of participants preferred HandyLabel over traditional annotation tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HandyLabel is a straightforward application of existing skeleton-based gesture recognition to real-time annotation, with concrete F1 and preference numbers, but the results on fixed HaGRID signs do not yet underwrite performance for user-defined gestures in live webcam conditions.

read the letter

The main thing here is a web tool that lets users define their own hand gestures for labeling data while it is being recorded, skipping the usual post-processing step. They evaluate several models on the HaGRID dataset and report that ResNet50 with skeleton preprocessing reaches 0.923 F1, then show that 88.9 percent of 46 participants preferred the tool over standard annotation software like Label Studio. That preference result is the most direct evidence they offer for usability in practice, especially for labeling things like emotions or comprehension where recall after the fact is unreliable. The integration itself is new in the sense that it combines customizable mapping with real-time feedback in one interface, and the model comparison with and without skeleton preprocessing is a clear, reproducible step. The user study adds a practical check that goes beyond pure accuracy numbers. The soft spot is the gap between the reported dataset metric and the live custom-gesture claim. HaGRID contains standardized signs under controlled conditions, but the tool is built for arbitrary user mappings, and the abstract and stress-test note give no accuracy, latency, or robustness numbers for those mappings under typical webcam lighting, pose variation, or occlusion. Without that link or an ablation on custom gestures, the 0.923 F1 does not fully support the real-time annotation promise. This paper is for HCI and ML practitioners who build video or activity datasets and want to cut labeling time. A reader working on annotation workflows or gesture interfaces would find the implementation and preference data useful. It deserves peer review because the core idea is testable and the empirical pieces are concrete enough for referees to give targeted feedback on the evaluation gaps.

Referee Report

2 major / 2 minor

Summary. The paper introduces HandyLabel, a web-based real-time annotation tool that uses hand gesture recognition to let users map custom hand signs to labels during data recording, avoiding post-processing. It evaluates multiple models on the public HaGRID dataset and reports that ResNet50 with skeleton-preprocessed images achieves an F1-score of 0.923. A user study with 46 participants finds that 88.9% prefer HandyLabel over traditional tools such as Label Studio.

Significance. If the central claims hold, the work offers a practical advance for ML data collection by enabling real-time gesture-based labeling, which could reduce time, labor, and memory-related errors especially for cognitive or affective annotations. Credit is due for grounding the evaluation in a public dataset (HaGRID) and for including an independent user-preference study; these provide concrete, falsifiable metrics rather than purely theoretical claims.

major comments (2)

[Model evaluation] Model evaluation section: The headline claim that HandyLabel supports reliable real-time annotation rests on the reported 0.923 F1 for ResNet50 on skeleton-preprocessed HaGRID images. However, HaGRID contains only fixed, standardized signs; the tool explicitly allows users to define arbitrary custom gesture mappings. No ablation, accuracy, or latency results are given for these custom mappings under live webcam conditions (variable lighting, pose, occlusion, or recording setup). Without this link, the HaGRID metric does not underwrite the real-time annotation performance asserted in the abstract and introduction.
[User study] User study section: The 88.9% preference result among 46 participants is presented as validation of usability, yet the manuscript supplies no error bars, exclusion criteria, task protocol, or statistical tests. Because the central claim includes improved annotation experience over post-processing tools, the absence of these details leaves the generalizability of the preference finding unverified.

minor comments (2)

[Abstract] Abstract: The reported F1-score and preference percentage appear without confidence intervals or basic experimental parameters (e.g., number of classes, training/validation split).
[Methods] Notation and figures: Skeleton preprocessing pipeline is referenced but not illustrated with a diagram or pseudocode; a figure showing the end-to-end real-time flow would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical potential of HandyLabel. We respond to each major comment below with clarifications and note planned revisions.

read point-by-point responses

Referee: [Model evaluation] Model evaluation section: The headline claim that HandyLabel supports reliable real-time annotation rests on the reported 0.923 F1 for ResNet50 on skeleton-preprocessed HaGRID images. However, HaGRID contains only fixed, standardized signs; the tool explicitly allows users to define arbitrary custom gesture mappings. No ablation, accuracy, or latency results are given for these custom mappings under live webcam conditions (variable lighting, pose, occlusion, or recording setup). Without this link, the HaGRID metric does not underwrite the real-time annotation performance asserted in the abstract and introduction.

Authors: We acknowledge the distinction between standardized signs in HaGRID and the custom mappings supported by the tool. The HaGRID evaluation establishes a reproducible baseline for the skeleton-preprocessed ResNet50 pipeline under controlled conditions, which underpins the recognition component. Custom gestures are user-defined but rely on the same model; we did not include live ablation studies for arbitrary gestures because such tests would require extensive new data collection across variable conditions. In the revision we will add an explicit limitations paragraph clarifying this scope and the assumptions for custom use. revision: partial
Referee: [User study] User study section: The 88.9% preference result among 46 participants is presented as validation of usability, yet the manuscript supplies no error bars, exclusion criteria, task protocol, or statistical tests. Because the central claim includes improved annotation experience over post-processing tools, the absence of these details leaves the generalizability of the preference finding unverified.

Authors: We agree that additional methodological details are needed to support the usability claim. The revised manuscript will report error bars on the preference rate, describe the task protocol and exclusion criteria, and include statistical analysis of the preference results. revision: yes

Circularity Check

0 steps flagged

No circularity: results rest on external dataset and independent user study

full rationale

The paper reports an empirical F1-score of 0.923 for ResNet50 on skeleton-preprocessed HaGRID images and a user-study preference of 88.9% from 46 participants. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The performance numbers are direct measurements against a public external dataset and participant responses; they do not reduce by construction to quantities defined or fitted inside the paper. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard computer-vision assumptions about gesture recognizability and on the validity of a small-scale user preference study; no new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption Hand-gesture recognition models trained on HaGRID generalize sufficiently to real-time annotation tasks.
Invoked when claiming the 0.923 F1-score supports practical use of HandyLabel.
domain assumption User preference in a 46-person study predicts real-world adoption and labeling quality.
Invoked when stating that 88.9% preference validates the tool.

pith-pipeline@v0.9.0 · 5524 in / 1345 out tokens · 35698 ms · 2026-05-17T05:01:40.794584+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923... 88.9% of participants preferred HandyLabel
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

skeleton-based preprocessing... MediaPipe... real-time annotation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Cus- tomized communication between healthcare members during the med- ication administration stage

Maali Alabdulhafith, Abdulhadi Alqarni, and Srinivas Sampalli. Cus- tomized communication between healthcare members during the med- ication administration stage. InProceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’18, New York, NY, USA, 2018. Association for Computing Machinery. IS...

work page doi:10.1145/3229434 2018
[2]

Modeling and simulation for behavioral analy- sis in healthcare: A review.ACM Trans

Athary Alwasel, Masoud Fakhimi, Navonil Mustafee, and Lam- pros Stergioulas. Modeling and simulation for behavioral analy- sis in healthcare: A review.ACM Trans. Model. Comput. Simul., June 2025. ISSN 1049-3301. doi: 10.1145/3742428. URL https: //doi.org/10.1145/3742428. Just Accepted

work page doi:10.1145/3742428 2025
[3]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020

work page 2020
[4]

Cauchard, Julien Epps, Jorge Goncalves, Jonna H¨ akkil¨ a, Vi- viane Herdel, and Monica Perusquia-Hernandez

Jessica R. Cauchard, Julien Epps, Jorge Goncalves, Jonna H¨ akkil¨ a, Vi- viane Herdel, and Monica Perusquia-Hernandez. Affective computing for mobile technologies. InAdjunct Proceedings of the 26th Interna- tional Conference on Mobile Human-Computer Interaction, Mobile- HCI ’24 Adjunct, New York, NY, USA, 2024. Association for Comput- ing Machinery. ISBN...

work page doi:10.1145/3640471.3680459 2024
[5]

Calvo, and Mark d’Inverno

Nicholas Davis, Rafael A. Calvo, and Mark d’Inverno. Creative ai: Inspiring human creativity through generative design. InProceedings of the 2019 ACM SIGCHI Conference on Creativity and Cognition, pages 185–195, 2019. doi: 10.1145/3325480.3326578

work page doi:10.1145/3325480.3326578 2019
[6]

Bert: Pre-training of deep bidirectional transformers for language understanding.NAACL-HLT, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.NAACL-HLT, 2019

work page 2019
[7]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021. H...

work page 2021
[8]

Data labeling: An empirical investigation into indus- trial challenges and mitigation strategies

Teodor Fredriksson, David Issa Mattos, Jan Bosch, and Helena Holm- str¨ om Olsson. Data labeling: An empirical investigation into indus- trial challenges and mitigation strategies. InInternational Conference on Product-Focused Software Process Improvement, pages 202–216. Springer, 2020

work page 2020
[9]

Mediapipe: A framework for building perception pipelines

Google. Mediapipe: A framework for building perception pipelines. https://mediapipe.dev/, 2019

work page 2019
[10]

Grzeszczyk, Anna Lisowska, Arkadiusz Sitek, and Aneta Lisowska

Michal K. Grzeszczyk, Anna Lisowska, Arkadiusz Sitek, and Aneta Lisowska. Decoding emotional valence from wearables: Can our data reveal our true feelings? InProceedings of the 25th International Conference on Mobile Human-Computer Interaction, MobileHCI ’23 Companion, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399241. doi: ...

work page doi:10.1145/3565066.3608698 2023
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ResNet-18

work page 2016
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ResNet-50

work page 2016
[13]

Deep learning empowered hand gesture recognition: using yolo tech- niques

Nourdine Herbaz, Hassan El Idrissi, and Abdelmajid Badri. Deep learning empowered hand gesture recognition: using yolo tech- niques. In2023 14th International Conference on Intelligent Sys- tems: Theories and Applications (SITA), pages 1–7, 2023. doi: 10.1109/SITA60746.2023.10373734

work page doi:10.1109/sita60746.2023.10373734 2023
[14]

Le, and Hartwig Adam

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Mingx- ing Tan, Bo Wang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 1314–1324, 2019

work page 2019
[15]

Cvat: Computer vision annotation tool

Intel. Cvat: Computer vision annotation tool. https://github.com/ openvinotoolkit/cvat, 2021

work page 2021
[16]

Hagrid – hand gesture recognition image dataset

Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid – hand gesture recognition image dataset. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4572–4581, January 2024

work page 2024
[17]

Automated hand gesture recognition for educa- tional applications

Vangjel Kazllarof, Stamatis Karlos, Angeliki-Panagiota Panagopoulou, and Sotiris Kotsiantis. Automated hand gesture recognition for educa- tional applications. InProceedings of the 20th Pan-Hellenic Conference on Informatics, PCI ’16, New York, NY, USA, 2016. Association for HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based ...

work page doi:10.1145/3003733 2016
[18]

Finger identification and hand gesture recognition techniques for natural user interface

Unseok Lee and Jiro Tanaka. Finger identification and hand gesture recognition techniques for natural user interface. InProceedings of the 11th Asia Pacific Conference on Computer Human Interaction, APCHI ’13, page 274–279, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450322539. doi: 10.1145/2525194. 2525296. URLhttps://doi.org/1...

work page doi:10.1145/2525194 2013
[19]

Deep learning based hand gesture recognition in virtual reality applications.IEEE Access, 7:131019–131029, 2019

Yifan Li, Yukun Wen, Shibin Qiu, and Anfeng Hao. Deep learning based hand gesture recognition in virtual reality applications.IEEE Access, 7:131019–131029, 2019

work page 2019
[20]

Nguyen, Binh P

Kim Chwee Lim, Swee Heng Sin, Chien Wei Lee, Weng Khin Chin, Junliang Lin, Khang Nguyen, Quang H. Nguyen, Binh P. Nguyen, and Matthew Chua. Video-based skeletal feature extraction for hand gesture recognition. InProceedings of the 4th International Conference on Machine Learning and Soft Computing, ICMLSC ’20, page 108–112, New York, NY, USA, 2020. Associ...

work page doi:10.1145/3380688.3380711 2020
[21]

Hand gesture recognition with 3d convolutional neural networks

Pavlo Molchanov, Shalini Gupta, Kihwan Kim, and Jan Kautz. Hand gesture recognition with 3d convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–7, 2015. doi: 10.1109/CVPRW.2015. 7301342

work page doi:10.1109/cvprw.2015 2015
[22]

A fully automatic hand gesture recognition system for human-robot interaction

Thi Thanh Mai Nguyen, Ngoc Hai Pham, Van Thai Dong, Viet Son Nguyen, and Thi Thanh Hai Tran. A fully automatic hand gesture recognition system for human-robot interaction. InProceedings of the 2nd Symposium on Information and Communication Technology, SoICT ’11, page 112–119, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308809...

work page doi:10.1145/2069216 2011
[23]

Real time hand gesture recognition using random forest and linear discriminant anal- ysis

Sangjun O., Rammohan Mallipeddi, and Minho Lee. Real time hand gesture recognition using random forest and linear discriminant anal- ysis. InProceedings of the 3rd International Conference on Human- Agent Interaction, HAI ’15, page 279–282, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450335270. doi: 10.1145/2814940.2814997. URL ...

work page doi:10.1145/2814940.2814997 2015
[24]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/index/ introducing-gpt-5/, 2025. Accessed: 2025-08-08

work page 2025
[25]

The challenge of data annotation in deep learning—a case study on whole plant corn silage.Sensors, 22(4):1596, 2022

Christoffer Bøgelund Rasmussen, Kristian Kirk, and Thomas B Moes- lund. The challenge of data annotation in deep learning—a case study on whole plant corn silage.Sensors, 22(4):1596, 2022. HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition 18

work page 2022
[26]

Imagenet-21k pretraining for the masses,

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021

work page arXiv 2021
[27]

High-resolution image synthesis with latent diffu- sion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffu- sion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[28]

and Sathiesh Kumar V

Rubin Bose S. and Sathiesh Kumar V. Hand gesture recognition using faster r-cnn inception v2 model. InProceedings of the 2019 4th International Conference on Advances in Robotics, AIR ’19, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450366502. doi: 10.1145/3352593.3352613. URL https://doi. org/10.1145/3352593.3352613

work page doi:10.1145/3352593.3352613 2019
[29]

Evaluating gesture recognition in virtual reality.arXiv preprint arXiv:2401.04545, 2024

Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta, Pascal Serrarens, and Daniele Nardi. Evaluating gesture recognition in virtual reality.arXiv preprint arXiv:2401.04545, 2024

work page arXiv 2024
[30]

Real-time attention-based embedded lstm for dynamic sign language recognition on edge devices.Journal of Real-Time Image Processing, 21(2):53, 2024

Vaidehi Sharma, Abhishek Sharma, and Sandeep Saini. Real-time attention-based embedded lstm for dynamic sign language recognition on edge devices.Journal of Real-Time Image Processing, 21(2):53, 2024

work page 2024
[31]

Emotional response language education for mobile devices

John Sloan, Daniel Maguire, and Julie Carson-Berndsen. Emotional response language education for mobile devices. In22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’20, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380522. doi: 10.1145/3406324. 3417603. URLhttps://doi.org/1...

work page doi:10.1145/3406324 2021
[32]

Scalability in perception datasets for autonomous driving.Proceedings of the IEEE, 108(7):1214–1243, 2020

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Ben- jamin Caine, et al. Scalability in perception datasets for autonomous driving.Proceedings of the IEEE, 108(7):1214–1243, 2020

work page 2020
[34]

Gesturegan for hand gesture-to-gesture translation in the wild

Hao Tang, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. Gesturegan for hand gesture-to-gesture translation in the wild. InProceedings of the 26th ACM International Conference on Multimedia, MM ’18, page 774–782, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356657. doi: 10.1145/3240508.3240704. URLhttps://doi.org/10.1145/3240508.32...

work page doi:10.1145/3240508.3240704 2018
[35]

Label Studio: Data labeling software, 2020-2022

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software, 2020-2022. URL https://github.com/heartexlabs/label-studio. Open source soft- ware available from https://github.com/heartexlabs/label-studio

work page 2020
[36]

Recognition of american sign language gestures in a virtual reality using leap motion.Applied Sciences, 9(3):445, 2019

Aurelijus Vaitkeviˇ cius, Mantas Taroza, Tomas Blaˇ zauskas, Robertas Damaˇ seviˇ cius, Rytis Maskeli¯ unas, and Marcin Wo´ zniak. Recognition of american sign language gestures in a virtual reality using leap motion.Applied Sciences, 9(3):445, 2019

work page 2019
[37]

Discaas: Mi- cro behavior analysis on discussion by camera as a sensor.Sen- sors, 21(17), 2021

Ko Watanabe, Yusuke Soneda, Yuki Matsuda, Yugo Nakamura, Yu- taka Arakawa, Andreas Dengel, and Shoya Ishimaru. Discaas: Mi- cro behavior analysis on discussion by camera as a sensor.Sen- sors, 21(17), 2021. ISSN 1424-8220. doi: 10.3390/s21175719. URL https://www.mdpi.com/1424-8220/21/17/5719

work page doi:10.3390/s21175719 2021
[38]

Engauge: Engagement gauge of meeting participants esti- mated by facial expression and deep neural network.IEEE Access, 11:52886–52898, 2023

Ko Watanabe, Tanuja Sathyanarayana, Andreas Dengel, and Shoya Ishimaru. Engauge: Engagement gauge of meeting participants esti- mated by facial expression and deep neural network.IEEE Access, 11:52886–52898, 2023

work page 2023
[39]

Metacognition- engauge: Real-time augmentation of self-and-group engagement levels understanding by gauge interface in online meetings

Ko Watanabe, Andreas Dengel, and Shoya Ishimaru. Metacognition- engauge: Real-time augmentation of self-and-group engagement levels understanding by gauge interface in online meetings. InProceedings of the Augmented Humans International Conference 2024, AHs ’24, page 301–303, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400709807...

work page doi:10.1145/3652920.3653054 2024
[40]

Next-gen dynamic hand gesture recognition: Mediapipe, inception-v3 and lstm-based enhanced deep learning model.Electron- ics, 13(16):3233, 2024

Yaseen, Oh-Jin Kwon, Jaeho Kim, Sonain Jamil, Jinhee Lee, and Faiz Ullah. Next-gen dynamic hand gesture recognition: Mediapipe, inception-v3 and lstm-based enhanced deep learning model.Electron- ics, 13(16):3233, 2024

work page 2024
[41]

Object detection in 20 years: A survey.International Journal of Computer Vision, 128(2):261–318, 2018

Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.International Journal of Computer Vision, 128(2):261–318, 2018. HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition

work page 2018

[1] [1]

Cus- tomized communication between healthcare members during the med- ication administration stage

Maali Alabdulhafith, Abdulhadi Alqarni, and Srinivas Sampalli. Cus- tomized communication between healthcare members during the med- ication administration stage. InProceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’18, New York, NY, USA, 2018. Association for Computing Machinery. IS...

work page doi:10.1145/3229434 2018

[2] [2]

Modeling and simulation for behavioral analy- sis in healthcare: A review.ACM Trans

Athary Alwasel, Masoud Fakhimi, Navonil Mustafee, and Lam- pros Stergioulas. Modeling and simulation for behavioral analy- sis in healthcare: A review.ACM Trans. Model. Comput. Simul., June 2025. ISSN 1049-3301. doi: 10.1145/3742428. URL https: //doi.org/10.1145/3742428. Just Accepted

work page doi:10.1145/3742428 2025

[3] [3]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020

work page 2020

[4] [4]

Cauchard, Julien Epps, Jorge Goncalves, Jonna H¨ akkil¨ a, Vi- viane Herdel, and Monica Perusquia-Hernandez

Jessica R. Cauchard, Julien Epps, Jorge Goncalves, Jonna H¨ akkil¨ a, Vi- viane Herdel, and Monica Perusquia-Hernandez. Affective computing for mobile technologies. InAdjunct Proceedings of the 26th Interna- tional Conference on Mobile Human-Computer Interaction, Mobile- HCI ’24 Adjunct, New York, NY, USA, 2024. Association for Comput- ing Machinery. ISBN...

work page doi:10.1145/3640471.3680459 2024

[5] [5]

Calvo, and Mark d’Inverno

Nicholas Davis, Rafael A. Calvo, and Mark d’Inverno. Creative ai: Inspiring human creativity through generative design. InProceedings of the 2019 ACM SIGCHI Conference on Creativity and Cognition, pages 185–195, 2019. doi: 10.1145/3325480.3326578

work page doi:10.1145/3325480.3326578 2019

[6] [6]

Bert: Pre-training of deep bidirectional transformers for language understanding.NAACL-HLT, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.NAACL-HLT, 2019

work page 2019

[7] [7]

An image is worth 16×16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021. H...

work page 2021

[8] [8]

Data labeling: An empirical investigation into indus- trial challenges and mitigation strategies

Teodor Fredriksson, David Issa Mattos, Jan Bosch, and Helena Holm- str¨ om Olsson. Data labeling: An empirical investigation into indus- trial challenges and mitigation strategies. InInternational Conference on Product-Focused Software Process Improvement, pages 202–216. Springer, 2020

work page 2020

[9] [9]

Mediapipe: A framework for building perception pipelines

Google. Mediapipe: A framework for building perception pipelines. https://mediapipe.dev/, 2019

work page 2019

[10] [10]

Grzeszczyk, Anna Lisowska, Arkadiusz Sitek, and Aneta Lisowska

Michal K. Grzeszczyk, Anna Lisowska, Arkadiusz Sitek, and Aneta Lisowska. Decoding emotional valence from wearables: Can our data reveal our true feelings? InProceedings of the 25th International Conference on Mobile Human-Computer Interaction, MobileHCI ’23 Companion, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399241. doi: ...

work page doi:10.1145/3565066.3608698 2023

[11] [11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ResNet-18

work page 2016

[12] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ResNet-50

work page 2016

[13] [13]

Deep learning empowered hand gesture recognition: using yolo tech- niques

Nourdine Herbaz, Hassan El Idrissi, and Abdelmajid Badri. Deep learning empowered hand gesture recognition: using yolo tech- niques. In2023 14th International Conference on Intelligent Sys- tems: Theories and Applications (SITA), pages 1–7, 2023. doi: 10.1109/SITA60746.2023.10373734

work page doi:10.1109/sita60746.2023.10373734 2023

[14] [14]

Le, and Hartwig Adam

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Mingx- ing Tan, Bo Wang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 1314–1324, 2019

work page 2019

[15] [15]

Cvat: Computer vision annotation tool

Intel. Cvat: Computer vision annotation tool. https://github.com/ openvinotoolkit/cvat, 2021

work page 2021

[16] [16]

Hagrid – hand gesture recognition image dataset

Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid – hand gesture recognition image dataset. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4572–4581, January 2024

work page 2024

[17] [17]

Automated hand gesture recognition for educa- tional applications

Vangjel Kazllarof, Stamatis Karlos, Angeliki-Panagiota Panagopoulou, and Sotiris Kotsiantis. Automated hand gesture recognition for educa- tional applications. InProceedings of the 20th Pan-Hellenic Conference on Informatics, PCI ’16, New York, NY, USA, 2016. Association for HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based ...

work page doi:10.1145/3003733 2016

[18] [18]

Finger identification and hand gesture recognition techniques for natural user interface

Unseok Lee and Jiro Tanaka. Finger identification and hand gesture recognition techniques for natural user interface. InProceedings of the 11th Asia Pacific Conference on Computer Human Interaction, APCHI ’13, page 274–279, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450322539. doi: 10.1145/2525194. 2525296. URLhttps://doi.org/1...

work page doi:10.1145/2525194 2013

[19] [19]

Deep learning based hand gesture recognition in virtual reality applications.IEEE Access, 7:131019–131029, 2019

Yifan Li, Yukun Wen, Shibin Qiu, and Anfeng Hao. Deep learning based hand gesture recognition in virtual reality applications.IEEE Access, 7:131019–131029, 2019

work page 2019

[20] [20]

Nguyen, Binh P

Kim Chwee Lim, Swee Heng Sin, Chien Wei Lee, Weng Khin Chin, Junliang Lin, Khang Nguyen, Quang H. Nguyen, Binh P. Nguyen, and Matthew Chua. Video-based skeletal feature extraction for hand gesture recognition. InProceedings of the 4th International Conference on Machine Learning and Soft Computing, ICMLSC ’20, page 108–112, New York, NY, USA, 2020. Associ...

work page doi:10.1145/3380688.3380711 2020

[21] [21]

Hand gesture recognition with 3d convolutional neural networks

Pavlo Molchanov, Shalini Gupta, Kihwan Kim, and Jan Kautz. Hand gesture recognition with 3d convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–7, 2015. doi: 10.1109/CVPRW.2015. 7301342

work page doi:10.1109/cvprw.2015 2015

[22] [22]

A fully automatic hand gesture recognition system for human-robot interaction

Thi Thanh Mai Nguyen, Ngoc Hai Pham, Van Thai Dong, Viet Son Nguyen, and Thi Thanh Hai Tran. A fully automatic hand gesture recognition system for human-robot interaction. InProceedings of the 2nd Symposium on Information and Communication Technology, SoICT ’11, page 112–119, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308809...

work page doi:10.1145/2069216 2011

[23] [23]

Real time hand gesture recognition using random forest and linear discriminant anal- ysis

Sangjun O., Rammohan Mallipeddi, and Minho Lee. Real time hand gesture recognition using random forest and linear discriminant anal- ysis. InProceedings of the 3rd International Conference on Human- Agent Interaction, HAI ’15, page 279–282, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450335270. doi: 10.1145/2814940.2814997. URL ...

work page doi:10.1145/2814940.2814997 2015

[24] [24]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/index/ introducing-gpt-5/, 2025. Accessed: 2025-08-08

work page 2025

[25] [25]

The challenge of data annotation in deep learning—a case study on whole plant corn silage.Sensors, 22(4):1596, 2022

Christoffer Bøgelund Rasmussen, Kristian Kirk, and Thomas B Moes- lund. The challenge of data annotation in deep learning—a case study on whole plant corn silage.Sensors, 22(4):1596, 2022. HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition 18

work page 2022

[26] [26]

Imagenet-21k pretraining for the masses,

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021

work page arXiv 2021

[27] [27]

High-resolution image synthesis with latent diffu- sion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffu- sion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[28] [28]

and Sathiesh Kumar V

Rubin Bose S. and Sathiesh Kumar V. Hand gesture recognition using faster r-cnn inception v2 model. InProceedings of the 2019 4th International Conference on Advances in Robotics, AIR ’19, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450366502. doi: 10.1145/3352593.3352613. URL https://doi. org/10.1145/3352593.3352613

work page doi:10.1145/3352593.3352613 2019

[29] [29]

Evaluating gesture recognition in virtual reality.arXiv preprint arXiv:2401.04545, 2024

Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta, Pascal Serrarens, and Daniele Nardi. Evaluating gesture recognition in virtual reality.arXiv preprint arXiv:2401.04545, 2024

work page arXiv 2024

[30] [30]

Real-time attention-based embedded lstm for dynamic sign language recognition on edge devices.Journal of Real-Time Image Processing, 21(2):53, 2024

Vaidehi Sharma, Abhishek Sharma, and Sandeep Saini. Real-time attention-based embedded lstm for dynamic sign language recognition on edge devices.Journal of Real-Time Image Processing, 21(2):53, 2024

work page 2024

[31] [31]

Emotional response language education for mobile devices

John Sloan, Daniel Maguire, and Julie Carson-Berndsen. Emotional response language education for mobile devices. In22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’20, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380522. doi: 10.1145/3406324. 3417603. URLhttps://doi.org/1...

work page doi:10.1145/3406324 2021

[32] [32]

Scalability in perception datasets for autonomous driving.Proceedings of the IEEE, 108(7):1214–1243, 2020

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Ben- jamin Caine, et al. Scalability in perception datasets for autonomous driving.Proceedings of the IEEE, 108(7):1214–1243, 2020

work page 2020

[33] [34]

Gesturegan for hand gesture-to-gesture translation in the wild

Hao Tang, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. Gesturegan for hand gesture-to-gesture translation in the wild. InProceedings of the 26th ACM International Conference on Multimedia, MM ’18, page 774–782, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356657. doi: 10.1145/3240508.3240704. URLhttps://doi.org/10.1145/3240508.32...

work page doi:10.1145/3240508.3240704 2018

[34] [35]

Label Studio: Data labeling software, 2020-2022

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software, 2020-2022. URL https://github.com/heartexlabs/label-studio. Open source soft- ware available from https://github.com/heartexlabs/label-studio

work page 2020

[35] [36]

Recognition of american sign language gestures in a virtual reality using leap motion.Applied Sciences, 9(3):445, 2019

Aurelijus Vaitkeviˇ cius, Mantas Taroza, Tomas Blaˇ zauskas, Robertas Damaˇ seviˇ cius, Rytis Maskeli¯ unas, and Marcin Wo´ zniak. Recognition of american sign language gestures in a virtual reality using leap motion.Applied Sciences, 9(3):445, 2019

work page 2019

[36] [37]

Discaas: Mi- cro behavior analysis on discussion by camera as a sensor.Sen- sors, 21(17), 2021

Ko Watanabe, Yusuke Soneda, Yuki Matsuda, Yugo Nakamura, Yu- taka Arakawa, Andreas Dengel, and Shoya Ishimaru. Discaas: Mi- cro behavior analysis on discussion by camera as a sensor.Sen- sors, 21(17), 2021. ISSN 1424-8220. doi: 10.3390/s21175719. URL https://www.mdpi.com/1424-8220/21/17/5719

work page doi:10.3390/s21175719 2021

[37] [38]

Engauge: Engagement gauge of meeting participants esti- mated by facial expression and deep neural network.IEEE Access, 11:52886–52898, 2023

Ko Watanabe, Tanuja Sathyanarayana, Andreas Dengel, and Shoya Ishimaru. Engauge: Engagement gauge of meeting participants esti- mated by facial expression and deep neural network.IEEE Access, 11:52886–52898, 2023

work page 2023

[38] [39]

Metacognition- engauge: Real-time augmentation of self-and-group engagement levels understanding by gauge interface in online meetings

Ko Watanabe, Andreas Dengel, and Shoya Ishimaru. Metacognition- engauge: Real-time augmentation of self-and-group engagement levels understanding by gauge interface in online meetings. InProceedings of the Augmented Humans International Conference 2024, AHs ’24, page 301–303, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400709807...

work page doi:10.1145/3652920.3653054 2024

[39] [40]

Next-gen dynamic hand gesture recognition: Mediapipe, inception-v3 and lstm-based enhanced deep learning model.Electron- ics, 13(16):3233, 2024

Yaseen, Oh-Jin Kwon, Jaeho Kim, Sonain Jamil, Jinhee Lee, and Faiz Ullah. Next-gen dynamic hand gesture recognition: Mediapipe, inception-v3 and lstm-based enhanced deep learning model.Electron- ics, 13(16):3233, 2024

work page 2024

[40] [41]

Object detection in 20 years: A survey.International Journal of Computer Vision, 128(2):261–318, 2018

Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.International Journal of Computer Vision, 128(2):261–318, 2018. HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition

work page 2018