HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition
Pith reviewed 2026-05-17 05:01 UTC · model grok-4.3
The pith
HandyLabel maps hand gestures to labels in real time so users annotate data during recording instead of afterward.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HandyLabel enables real-time annotation by mapping user-defined hand gestures to labels through skeleton-based recognition; ResNet50 on preprocessed skeleton images yields an F1-score of 0.923 on HaGRID, and 88.9 percent of 46 study participants favored the tool over post-processing alternatives.
What carries the argument
A web interface for customizing gesture-to-label mappings combined with real-time skeleton-preprocessed hand gesture recognition models.
If this is right
- Annotation of large video or sensor streams no longer requires separate post-recording sessions.
- Labelers can assign subjective categories such as emotions without relying on later memory recall.
- Custom gesture sets can be tailored to specific domains like medical or educational data collection.
- Faster iteration cycles become possible for machine-learning projects that depend on fresh labeled data.
Where Pith is reading between the lines
- If the gesture recognizer tolerates partial occlusions or varying camera angles, the same pipeline could support annotation on mobile devices during field recordings.
- Pairing the gesture channel with brief voice confirmations might further lower error rates for complex labeling schemes.
- The skeleton-preprocessing step could be reused in other real-time HCI applications that need low-latency hand input without full video transmission.
Load-bearing premise
Hand gesture recognition stays accurate and low-error when users actually record and label data in typical environments, and the reported user preference generalizes to ordinary annotation work.
What would settle it
Run the same 46-participant labeling task on a new dataset recorded in uncontrolled lighting and backgrounds; if accuracy falls below 0.80 F1 or preference drops below 70 percent, the central claim does not hold.
Figures
read the original abstract
The success of machine learning is deeply linked to the availability of high-quality training data, yet retrieving and manually labeling new data remains a time-consuming and error-prone process. Traditional annotation tools, such as Label Studio, often require post-processing, where users label data after it has been recorded. Post-processing is highly time-consuming and labor-intensive, especially with large datasets, and may lead to erroneous annotations due to the difficulty of subjects' memory tasks when labeling cognitive activities such as emotions or comprehension levels. In this work, we introduce HandyLabel, a real-time annotation tool that leverages hand gesture recognition to map hand signs for labeling. The application enables users to customize gesture mappings through a web-based interface, allowing for real-time annotations. To ensure the performance of HandyLabel, we evaluate several hand gesture recognition models on an open-source hand sign (HaGRID) dataset, with and without skeleton-based preprocessing. We discovered that ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923. To validate the usability of HandyLabel, a user study was conducted with 46 participants. The results suggest that 88.9% of participants preferred HandyLabel over traditional annotation tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HandyLabel, a web-based real-time annotation tool that uses hand gesture recognition to let users map custom hand signs to labels during data recording, avoiding post-processing. It evaluates multiple models on the public HaGRID dataset and reports that ResNet50 with skeleton-preprocessed images achieves an F1-score of 0.923. A user study with 46 participants finds that 88.9% prefer HandyLabel over traditional tools such as Label Studio.
Significance. If the central claims hold, the work offers a practical advance for ML data collection by enabling real-time gesture-based labeling, which could reduce time, labor, and memory-related errors especially for cognitive or affective annotations. Credit is due for grounding the evaluation in a public dataset (HaGRID) and for including an independent user-preference study; these provide concrete, falsifiable metrics rather than purely theoretical claims.
major comments (2)
- [Model evaluation] Model evaluation section: The headline claim that HandyLabel supports reliable real-time annotation rests on the reported 0.923 F1 for ResNet50 on skeleton-preprocessed HaGRID images. However, HaGRID contains only fixed, standardized signs; the tool explicitly allows users to define arbitrary custom gesture mappings. No ablation, accuracy, or latency results are given for these custom mappings under live webcam conditions (variable lighting, pose, occlusion, or recording setup). Without this link, the HaGRID metric does not underwrite the real-time annotation performance asserted in the abstract and introduction.
- [User study] User study section: The 88.9% preference result among 46 participants is presented as validation of usability, yet the manuscript supplies no error bars, exclusion criteria, task protocol, or statistical tests. Because the central claim includes improved annotation experience over post-processing tools, the absence of these details leaves the generalizability of the preference finding unverified.
minor comments (2)
- [Abstract] Abstract: The reported F1-score and preference percentage appear without confidence intervals or basic experimental parameters (e.g., number of classes, training/validation split).
- [Methods] Notation and figures: Skeleton preprocessing pipeline is referenced but not illustrated with a diagram or pseudocode; a figure showing the end-to-end real-time flow would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical potential of HandyLabel. We respond to each major comment below with clarifications and note planned revisions.
read point-by-point responses
-
Referee: [Model evaluation] Model evaluation section: The headline claim that HandyLabel supports reliable real-time annotation rests on the reported 0.923 F1 for ResNet50 on skeleton-preprocessed HaGRID images. However, HaGRID contains only fixed, standardized signs; the tool explicitly allows users to define arbitrary custom gesture mappings. No ablation, accuracy, or latency results are given for these custom mappings under live webcam conditions (variable lighting, pose, occlusion, or recording setup). Without this link, the HaGRID metric does not underwrite the real-time annotation performance asserted in the abstract and introduction.
Authors: We acknowledge the distinction between standardized signs in HaGRID and the custom mappings supported by the tool. The HaGRID evaluation establishes a reproducible baseline for the skeleton-preprocessed ResNet50 pipeline under controlled conditions, which underpins the recognition component. Custom gestures are user-defined but rely on the same model; we did not include live ablation studies for arbitrary gestures because such tests would require extensive new data collection across variable conditions. In the revision we will add an explicit limitations paragraph clarifying this scope and the assumptions for custom use. revision: partial
-
Referee: [User study] User study section: The 88.9% preference result among 46 participants is presented as validation of usability, yet the manuscript supplies no error bars, exclusion criteria, task protocol, or statistical tests. Because the central claim includes improved annotation experience over post-processing tools, the absence of these details leaves the generalizability of the preference finding unverified.
Authors: We agree that additional methodological details are needed to support the usability claim. The revised manuscript will report error bars on the preference rate, describe the task protocol and exclusion criteria, and include statistical analysis of the preference results. revision: yes
Circularity Check
No circularity: results rest on external dataset and independent user study
full rationale
The paper reports an empirical F1-score of 0.923 for ResNet50 on skeleton-preprocessed HaGRID images and a user-study preference of 88.9% from 46 participants. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The performance numbers are direct measurements against a public external dataset and participant responses; they do not reduce by construction to quantities defined or fitted inside the paper. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Hand-gesture recognition models trained on HaGRID generalize sufficiently to real-time annotation tasks.
- domain assumption User preference in a 46-person study predicts real-world adoption and labeling quality.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923... 88.9% of participants preferred HandyLabel
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
skeleton-based preprocessing... MediaPipe... real-time annotation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cus- tomized communication between healthcare members during the med- ication administration stage
Maali Alabdulhafith, Abdulhadi Alqarni, and Srinivas Sampalli. Cus- tomized communication between healthcare members during the med- ication administration stage. InProceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’18, New York, NY, USA, 2018. Association for Computing Machinery. IS...
-
[2]
Modeling and simulation for behavioral analy- sis in healthcare: A review.ACM Trans
Athary Alwasel, Masoud Fakhimi, Navonil Mustafee, and Lam- pros Stergioulas. Modeling and simulation for behavioral analy- sis in healthcare: A review.ACM Trans. Model. Comput. Simul., June 2025. ISSN 1049-3301. doi: 10.1145/3742428. URL https: //doi.org/10.1145/3742428. Just Accepted
-
[3]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020
work page 2020
-
[4]
Jessica R. Cauchard, Julien Epps, Jorge Goncalves, Jonna H¨ akkil¨ a, Vi- viane Herdel, and Monica Perusquia-Hernandez. Affective computing for mobile technologies. InAdjunct Proceedings of the 26th Interna- tional Conference on Mobile Human-Computer Interaction, Mobile- HCI ’24 Adjunct, New York, NY, USA, 2024. Association for Comput- ing Machinery. ISBN...
-
[5]
Nicholas Davis, Rafael A. Calvo, and Mark d’Inverno. Creative ai: Inspiring human creativity through generative design. InProceedings of the 2019 ACM SIGCHI Conference on Creativity and Cognition, pages 185–195, 2019. doi: 10.1145/3325480.3326578
-
[6]
Bert: Pre-training of deep bidirectional transformers for language understanding.NAACL-HLT, 2019
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.NAACL-HLT, 2019
work page 2019
-
[7]
An image is worth 16×16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021. H...
work page 2021
-
[8]
Data labeling: An empirical investigation into indus- trial challenges and mitigation strategies
Teodor Fredriksson, David Issa Mattos, Jan Bosch, and Helena Holm- str¨ om Olsson. Data labeling: An empirical investigation into indus- trial challenges and mitigation strategies. InInternational Conference on Product-Focused Software Process Improvement, pages 202–216. Springer, 2020
work page 2020
-
[9]
Mediapipe: A framework for building perception pipelines
Google. Mediapipe: A framework for building perception pipelines. https://mediapipe.dev/, 2019
work page 2019
-
[10]
Grzeszczyk, Anna Lisowska, Arkadiusz Sitek, and Aneta Lisowska
Michal K. Grzeszczyk, Anna Lisowska, Arkadiusz Sitek, and Aneta Lisowska. Decoding emotional valence from wearables: Can our data reveal our true feelings? InProceedings of the 25th International Conference on Mobile Human-Computer Interaction, MobileHCI ’23 Companion, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399241. doi: ...
-
[11]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ResNet-18
work page 2016
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. ResNet-50
work page 2016
-
[13]
Deep learning empowered hand gesture recognition: using yolo tech- niques
Nourdine Herbaz, Hassan El Idrissi, and Abdelmajid Badri. Deep learning empowered hand gesture recognition: using yolo tech- niques. In2023 14th International Conference on Intelligent Sys- tems: Theories and Applications (SITA), pages 1–7, 2023. doi: 10.1109/SITA60746.2023.10373734
-
[14]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Mingx- ing Tan, Bo Wang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 1314–1324, 2019
work page 2019
-
[15]
Cvat: Computer vision annotation tool
Intel. Cvat: Computer vision annotation tool. https://github.com/ openvinotoolkit/cvat, 2021
work page 2021
-
[16]
Hagrid – hand gesture recognition image dataset
Alexander Kapitanov, Karina Kvanchiani, Alexander Nagaev, Roman Kraynov, and Andrei Makhliarchuk. Hagrid – hand gesture recognition image dataset. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4572–4581, January 2024
work page 2024
-
[17]
Automated hand gesture recognition for educa- tional applications
Vangjel Kazllarof, Stamatis Karlos, Angeliki-Panagiota Panagopoulou, and Sotiris Kotsiantis. Automated hand gesture recognition for educa- tional applications. InProceedings of the 20th Pan-Hellenic Conference on Informatics, PCI ’16, New York, NY, USA, 2016. Association for HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based ...
-
[18]
Finger identification and hand gesture recognition techniques for natural user interface
Unseok Lee and Jiro Tanaka. Finger identification and hand gesture recognition techniques for natural user interface. InProceedings of the 11th Asia Pacific Conference on Computer Human Interaction, APCHI ’13, page 274–279, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450322539. doi: 10.1145/2525194. 2525296. URLhttps://doi.org/1...
-
[19]
Yifan Li, Yukun Wen, Shibin Qiu, and Anfeng Hao. Deep learning based hand gesture recognition in virtual reality applications.IEEE Access, 7:131019–131029, 2019
work page 2019
-
[20]
Kim Chwee Lim, Swee Heng Sin, Chien Wei Lee, Weng Khin Chin, Junliang Lin, Khang Nguyen, Quang H. Nguyen, Binh P. Nguyen, and Matthew Chua. Video-based skeletal feature extraction for hand gesture recognition. InProceedings of the 4th International Conference on Machine Learning and Soft Computing, ICMLSC ’20, page 108–112, New York, NY, USA, 2020. Associ...
-
[21]
Hand gesture recognition with 3d convolutional neural networks
Pavlo Molchanov, Shalini Gupta, Kihwan Kim, and Jan Kautz. Hand gesture recognition with 3d convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–7, 2015. doi: 10.1109/CVPRW.2015. 7301342
-
[22]
A fully automatic hand gesture recognition system for human-robot interaction
Thi Thanh Mai Nguyen, Ngoc Hai Pham, Van Thai Dong, Viet Son Nguyen, and Thi Thanh Hai Tran. A fully automatic hand gesture recognition system for human-robot interaction. InProceedings of the 2nd Symposium on Information and Communication Technology, SoICT ’11, page 112–119, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308809...
-
[23]
Real time hand gesture recognition using random forest and linear discriminant anal- ysis
Sangjun O., Rammohan Mallipeddi, and Minho Lee. Real time hand gesture recognition using random forest and linear discriminant anal- ysis. InProceedings of the 3rd International Conference on Human- Agent Interaction, HAI ’15, page 279–282, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450335270. doi: 10.1145/2814940.2814997. URL ...
-
[24]
OpenAI. Introducing gpt-5. https://openai.com/index/ introducing-gpt-5/, 2025. Accessed: 2025-08-08
work page 2025
-
[25]
Christoffer Bøgelund Rasmussen, Kristian Kirk, and Thomas B Moes- lund. The challenge of data annotation in deep learning—a case study on whole plant corn silage.Sensors, 22(4):1596, 2022. HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition 18
work page 2022
-
[26]
Imagenet-21k pretraining for the masses,
Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021
-
[27]
High-resolution image synthesis with latent diffu- sion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffu- sion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[28]
Rubin Bose S. and Sathiesh Kumar V. Hand gesture recognition using faster r-cnn inception v2 model. InProceedings of the 2019 4th International Conference on Advances in Robotics, AIR ’19, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450366502. doi: 10.1145/3352593.3352613. URL https://doi. org/10.1145/3352593.3352613
-
[29]
Evaluating gesture recognition in virtual reality.arXiv preprint arXiv:2401.04545, 2024
Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta, Pascal Serrarens, and Daniele Nardi. Evaluating gesture recognition in virtual reality.arXiv preprint arXiv:2401.04545, 2024
-
[30]
Vaidehi Sharma, Abhishek Sharma, and Sandeep Saini. Real-time attention-based embedded lstm for dynamic sign language recognition on edge devices.Journal of Real-Time Image Processing, 21(2):53, 2024
work page 2024
-
[31]
Emotional response language education for mobile devices
John Sloan, Daniel Maguire, and Julie Carson-Berndsen. Emotional response language education for mobile devices. In22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’20, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380522. doi: 10.1145/3406324. 3417603. URLhttps://doi.org/1...
-
[32]
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Ben- jamin Caine, et al. Scalability in perception datasets for autonomous driving.Proceedings of the IEEE, 108(7):1214–1243, 2020
work page 2020
-
[34]
Gesturegan for hand gesture-to-gesture translation in the wild
Hao Tang, Wei Wang, Dan Xu, Yan Yan, and Nicu Sebe. Gesturegan for hand gesture-to-gesture translation in the wild. InProceedings of the 26th ACM International Conference on Multimedia, MM ’18, page 774–782, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450356657. doi: 10.1145/3240508.3240704. URLhttps://doi.org/10.1145/3240508.32...
-
[35]
Label Studio: Data labeling software, 2020-2022
Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software, 2020-2022. URL https://github.com/heartexlabs/label-studio. Open source soft- ware available from https://github.com/heartexlabs/label-studio
work page 2020
-
[36]
Aurelijus Vaitkeviˇ cius, Mantas Taroza, Tomas Blaˇ zauskas, Robertas Damaˇ seviˇ cius, Rytis Maskeli¯ unas, and Marcin Wo´ zniak. Recognition of american sign language gestures in a virtual reality using leap motion.Applied Sciences, 9(3):445, 2019
work page 2019
-
[37]
Discaas: Mi- cro behavior analysis on discussion by camera as a sensor.Sen- sors, 21(17), 2021
Ko Watanabe, Yusuke Soneda, Yuki Matsuda, Yugo Nakamura, Yu- taka Arakawa, Andreas Dengel, and Shoya Ishimaru. Discaas: Mi- cro behavior analysis on discussion by camera as a sensor.Sen- sors, 21(17), 2021. ISSN 1424-8220. doi: 10.3390/s21175719. URL https://www.mdpi.com/1424-8220/21/17/5719
-
[38]
Ko Watanabe, Tanuja Sathyanarayana, Andreas Dengel, and Shoya Ishimaru. Engauge: Engagement gauge of meeting participants esti- mated by facial expression and deep neural network.IEEE Access, 11:52886–52898, 2023
work page 2023
-
[39]
Ko Watanabe, Andreas Dengel, and Shoya Ishimaru. Metacognition- engauge: Real-time augmentation of self-and-group engagement levels understanding by gauge interface in online meetings. InProceedings of the Augmented Humans International Conference 2024, AHs ’24, page 301–303, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400709807...
-
[40]
Yaseen, Oh-Jin Kwon, Jaeho Kim, Sonain Jamil, Jinhee Lee, and Faiz Ullah. Next-gen dynamic hand gesture recognition: Mediapipe, inception-v3 and lstm-based enhanced deep learning model.Electron- ics, 13(16):3233, 2024
work page 2024
-
[41]
Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.International Journal of Computer Vision, 128(2):261–318, 2018. HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.