Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

Atmika Bhardwaj; Nico Steckhan; Silvia Vock

arxiv: 2606.01896 · v1 · pith:BY46YY3Dnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

Atmika Bhardwaj , Silvia Vock , Nico Steckhan This is my paper

Pith reviewed 2026-06-28 15:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hand detectionsynthetic datagenerative inpaintingtraining scheduleout-of-distributionYOLOv8gloved handsoccupational safety

0 comments

The pith

A two-stage training schedule on real and synthetic hand images followed by real-only fine-tuning raises mAP@0.5 on standard tests and narrows the gap on gloved hands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Public hand datasets mostly show bare hands, so detectors struggle with gloves and other accessories common in safety settings. The authors build paired real and inpainted synthetic images and run YOLOv8n detectors through six different training schedules with statistical tests. The two-stage schedule—training on the mixed set then fine-tuning on real data at lower learning rate—beats the real-only baseline on the standard real test set and reduces the performance drop on real-gloves data. A three-stage schedule performs best on the stricter mAP@0.5:0.95 metric. The paper concludes that the value of synthetic accessory data depends on the training schedule used.

Core claim

On a paired dataset of real images and their synthetic counterparts, a two-stage experiment training on real union synthetic data then fine-tuning the resulting weights on real-only at a lower learning rate increases mAP@0.5 compared to the real-only baseline model on the standard real test set and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study.

What carries the argument

The two-stage and three-stage training-and-scheduling regimes that combine real images with generative inpainted accessory variations before evaluation on real and real-gloves test splits.

If this is right

The synthetic-data utility for safety-critical hand detection is determined by the training procedure.
Simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.
The two-stage schedule improves both in-distribution mAP@0.5 and the out-of-distribution gap on gloved hands.
The three-stage schedule yields the best bounding-box tightness as measured by mAP@0.5:0.95.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scheduling pattern could be tested with other generative methods such as diffusion-based editing to see whether the benefit is specific to inpainting.
Collecting fewer real gloved examples might become feasible if the two-stage schedule reliably transfers across different camera setups.
Extending the paired construction to tattoos or jewelry would test whether the approach scales to other sources of appearance variation.
Repeating the six schedules with a different detector architecture would show whether the schedule effect is architecture-dependent.

Load-bearing premise

The generative inpainting step produces accessory variations whose visual statistics match those of real gloved hands closely enough that the detector learns transferable features rather than inpainting-specific artifacts.

What would settle it

Training the two-stage schedule on the paired data and then measuring mAP@0.5 on an independently collected real-gloves test set that shows no improvement over the real-only baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.01896 by Atmika Bhardwaj, Nico Steckhan, Silvia Vock.

**Figure 2.** Figure 2: An illustrative workflow of this work. 3.3 Image Comparison Metrics A systematic workflow is shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Per-image PSNR, SSIM, and LPIPS for full image (gray; [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Five-metric radar comparison of the trained detectors on (left) the real [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two-stage training with inpainted glove data lifts mAP on real-gloved hands, but the result rests on an untested assumption that the synthetics match real accessory statistics.

read the letter

The main finding is that pre-training on real plus inpainted synthetic hands, followed by real-only fine-tuning at lower learning rate, raises mAP@0.5 on both the standard test set and the real-gloves split compared with a real-only baseline. A three-stage variant also improves mAP@0.5:0.95. They run six schedules, three seeds each, and report paired statistical tests.

The paper does a clean job of isolating the effect of training schedule on a documented distribution shift. Hand detection for safety needs gloves and other accessories represented; public datasets mostly lack them. Using inpainting to edit only the hand region is a reasonable way to create paired data, and systematically varying the schedule is a legitimate next step after earlier augmentation work.

The soft spot is the missing check on whether the inpainted gloves actually look like real ones. No FID, LPIPS, or perceptual comparison is mentioned, so it remains possible that the first-stage model learns diffusion artifacts and the later fine-tune simply regularizes them away on real data. That would explain the in-distribution gain without guaranteeing true OOD transfer to authentic gloves. The study is also narrow—one class, one generator, no broader ablation on artifact controls.

This is useful for groups already running inpainting pipelines for PPE or similar detection tasks. It does not introduce new methods or theory, but the empirical comparison of schedules is reproducible enough to be worth referee time. I would send it for review with a request for the missing synthetic-data validation.

Referee Report

3 major / 2 minor

Summary. The paper evaluates whether generative inpainting of accessories (gloves, jewelry) onto real hand images can mitigate distribution shift in hand detection for safety applications. Using a paired real/synthetic dataset, it trains YOLOv8n detectors under six schedules (Experiments A–F, three seeds each), reporting mAP@0.5 and mAP@0.5:0.95 on a standard real test set and a real-gloves OOD split, plus paired statistical tests. The central finding is that a two-stage schedule (pre-train on real ∪ synthetic, then real-only fine-tune at reduced LR) raises mAP@0.5 over the real-only baseline and narrows the OOD gap; a three-stage schedule yields the highest mAP@0.5:0.95.

Significance. If the empirical results are robust, the work shows that synthetic-data utility for hand detection is schedule-dependent rather than automatic, and that simple multi-stage fine-tuning can extract measurable in-distribution and OOD gains from inpainted accessory data. The design (multiple seeds, two mAP thresholds, paired tests) provides a reproducible template for schedule-sensitive evaluation of generative augmentation. This is relevant for safety-critical CV where real accessory data is scarce.

major comments (3)

[Abstract / paired dataset construction] Abstract and paired-dataset construction paragraph: no quantitative validation (FID, LPIPS, frequency analysis, or perceptual study) is reported comparing inpainted glove statistics to real-glove crops. This assumption is load-bearing for the claim that the two-stage schedule improves transferable features rather than allowing exploitation of diffusion artifacts; without it, the OOD-gap reduction could be an artifact of the first-stage regularizer.
[Experiments A-F] Experiments section (schedules A–F): the paper does not state whether the six schedules and the two- and three-stage variants were pre-specified before seeing results or selected after observing performance. Given that the headline claim rests on the superiority of specific multi-stage regimes, post-hoc selection would undermine the statistical tests reported across seeds.
[Results / OOD evaluation] Results on real-gloves OOD split: while paired tests are mentioned, the manuscript does not report effect sizes, confidence intervals on the mAP difference, or an ablation replacing inpainted data with noise- or artifact-matched controls. These are needed to establish that the observed gap reduction is attributable to accessory feature transfer.

minor comments (2)

[Figures and Tables] Figure captions and table headers should explicitly state the number of seeds and whether error bars are standard deviation or standard error.
[Method] Notation for the fine-tuning learning rate schedule is introduced without a compact equation; a short definition would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. Below we respond point-by-point to the major comments, indicating where revisions will strengthen the work while maintaining the integrity of the reported experiments.

read point-by-point responses

Referee: [Abstract / paired dataset construction] Abstract and paired-dataset construction paragraph: no quantitative validation (FID, LPIPS, frequency analysis, or perceptual study) is reported comparing inpainted glove statistics to real-glove crops. This assumption is load-bearing for the claim that the two-stage schedule improves transferable features rather than allowing exploitation of diffusion artifacts; without it, the OOD-gap reduction could be an artifact of the first-stage regularizer.

Authors: We agree that explicit quantitative validation of the inpainted accessories would strengthen the interpretation that gains arise from feature transfer rather than artifacts. The paired construction (identical backgrounds and poses, hand-region edit only) provides some control, but does not replace distributional metrics. In the revision we will add a dedicated paragraph reporting FID and LPIPS between inpainted glove regions and real-glove crops drawn from the OOD test split, plus a short note on visual inspection. This addresses the load-bearing concern directly. revision: yes
Referee: [Experiments A-F] Experiments section (schedules A–F): the paper does not state whether the six schedules and the two- and three-stage variants were pre-specified before seeing results or selected after observing performance. Given that the headline claim rests on the superiority of specific multi-stage regimes, post-hoc selection would undermine the statistical tests reported across seeds.

Authors: The six schedules were defined a priori, motivated by standard multi-stage transfer-learning patterns in the literature (pre-train on mixed data, then real-only fine-tuning at reduced LR, plus an additional stage for box-tightness). We will revise the Experiments section to state explicitly that the schedule set was fixed before any training runs or result inspection occurred. This clarification preserves the validity of the paired tests across seeds. revision: yes
Referee: [Results / OOD evaluation] Results on real-gloves OOD split: while paired tests are mentioned, the manuscript does not report effect sizes, confidence intervals on the mAP difference, or an ablation replacing inpainted data with noise- or artifact-matched controls. These are needed to establish that the observed gap reduction is attributable to accessory feature transfer.

Authors: We will add effect sizes (Cohen’s d) and 95 % confidence intervals for all reported mAP differences between the two- and three-stage schedules and the real-only baseline on both test sets. An artifact-matched control ablation lies outside the current scope, which centers on schedule sensitivity rather than data-quality isolation; we will acknowledge this limitation in the revised Discussion and note that future work could include such controls. The paired statistical tests already show consistent directional gains across seeds. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper reports results from six training-and-scheduling regimes (Experiments A-F) on YOLOv8n detectors, measuring mAP@0.5 and mAP@0.5:0.95 on real and real-gloves test splits. No equations, fitted parameters, uniqueness theorems, or derivations appear; all claims reduce to direct empirical comparisons against external test sets. The two-stage schedule benefit is presented as an observed outcome, not a constructed prediction, and the study is self-contained against standard detection benchmarks without self-referential reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the generative model producing accessory appearances whose statistics are close enough to real data that staged training can transfer; no free parameters are numerically fitted in the reported claim, but the learning-rate schedule and data-mixing ratios function as chosen hyperparameters.

free parameters (1)

fine-tuning learning rate
Described only as 'lower learning rate' in the two-stage regime; its specific value is a modeling choice that affects whether the reported mAP gain appears.

axioms (2)

domain assumption YOLOv8n architecture and standard mAP metrics are appropriate proxies for safety-critical hand detection performance
Invoked by the choice of detector and evaluation protocol throughout the abstract.
domain assumption The paired real/synthetic dataset faithfully represents the target deployment distribution shift
Required for the claim that closing the measured gap improves real-deployment utility.

pith-pipeline@v0.9.1-grok · 5837 in / 1481 out tokens · 25486 ms · 2026-06-28T15:48:10.132816+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 13 canonical work pages

[1]

Afifi, M.: 11k hands: Gender recognition and biometric identification using a large dataset of hand images (2018),https://arxiv.org/abs/1711.04322

Pith/arXiv arXiv 2018
[2]

In: Nature Scientific Data

Chen, Z., Chen, H., Ouyang, Y., Cao, C., Gao, W., Hu, Q., Jin, H., Zhang, S.: A high-resolution and whole-body dataset of hand-object contact areas based on 3D scanning method. In: Nature Scientific Data. Nature Scientific Data, vol. 12, p. 451 (Mar 2025).https://doi.org/10.1038/s41597-025-04770-x

work page doi:10.1038/s41597-025-04770-x 2025
[3]

Hillsdale, NJ: L

Cohen, J.: Statistical power analysis for the behavioral sciences. Hillsdale, NJ: L. Erlbaum Associates (1988) Evaluation of Generative Data 15

1988
[4]

European radiology experimental4(1), 18 (2020).https://doi.org/10.1186/s41747-020-0145-y

Di Leo, G., Sardanelli, F.: Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. European radiology experimental4(1), 18 (2020).https://doi.org/10.1186/s41747-020-0145-y

work page doi:10.1186/s41747-020-0145-y 2020
[5]

IEEE Access9, 11358–11371 (2021).https://doi.org/10.1109/ACCESS.2020.3048315

Faragallah, O.S., El-Hoseny, H., El-Shafai, W., El-Rahman, W.A., El-Sayed, H.S., El-Rabaie, E.S.M., El-Samie, F.E.A., Geweid, G.G.N.: A comprehensive survey analysis for present solutions of medical image fusion and future directions. IEEE Access9, 11358–11371 (2021).https://doi.org/10.1109/ACCESS.2020.3048315

work page doi:10.1109/access.2020.3048315 2021
[6]

Robotics and Computer- Integrated Manufacturing94, 102957 (2025).https://doi.org/10.1016/j.rcim

Hubert, C., Odica, N., Noel, M., Gharib, S., Zargarbashi, S.H., Séoud, L.: Mu- vih: Multi-view hand gesture dataset and recognition pipeline for human–robot interaction in a collaborative robotic finishing platform. Robotics and Computer- Integrated Manufacturing94, 102957 (2025).https://doi.org/10.1016/j.rcim. 2025.102957

work page doi:10.1016/j.rcim 2025
[7]

Islam, M.S., Shaqib, S., Ramit, S.S., Khushbu, S.A., Sattar, A., Noori, S.R.H.: A deep learning approach to detect complete safety equipment for construction workers based on yolov7 (2024),https://arxiv.org/abs/2406.07707

arXiv 2024
[8]

Robotics and Computer-Integrated Manufactur- ing97, 103110 (2026).https://doi.org/10.1016/j.rcim.2025.103110,https: //www.sciencedirect.com/science/article/pii/S0736584525001644

Jalayer, R., Jalayer, M., Orsenigo, C., Tomizuka, M.: A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recogni- tion in human–robot interaction. Robotics and Computer-Integrated Manufactur- ing97, 103110 (2026).https://doi.org/10.1016/j.rcim.2025.103110,https: //www.sciencedirect.com/science/article/pii/S073658...

work page doi:10.1016/j.rcim.2025.103110 2026
[9]

Kumar, M.E

Kuhn, H.W.: The hungarian method for the assignment problem. Naval Re- search Logistics Quarterly2(1-2), 83–97 (1955).https://doi.org/10.1002/ nav.3800020109,https://onlinelibrary.wiley.com/doi/abs/10.1002/nav. 3800020109

work page doi:10.1002/nav 1955
[10]

In: Proceedings of the 40th International Conference on Machine Learning

Kulinski, S., Inouye, D.I.: Towards explaining distribution shifts. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

2023
[11]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXVI

Li, Y., Dong, X., Chen, C., Zhuang, W., Lyu, L.: A simple background augmen- tation method for object detection with diffusion model. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXVI. p. 462–479. Springer-Verlag, Berlin, Heidel- berg (2024).https://doi.org/10.1007/978-3-031-72848...

work page doi:10.1007/978-3-031-72848-8_27 2024
[12]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVII. p. 38–55. Springer-Verlag, ...

work page doi:10.1007/978-3-031-72970-6_3 2024
[13]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024),https://arxiv.org/abs/2408.00714

Pith/arXiv arXiv 2024
[14]

In: 2008 15th IEEE International Conference on Image Processing

Rouse, D.M., Hemami, S.S.: Understanding and simplifying the structural similar- ity metric. In: 2008 15th IEEE International Conference on Image Processing. pp. 1188–1191 (2008).https://doi.org/10.1109/ICIP.2008.4711973

work page doi:10.1109/icip.2008.4711973 2008
[15]

Bhardwaj et al

Sharma, S., Huang, M., Nair, S., Wen, A., Petlowany, C., Moore, J., Wanna, S., Pryor, M.: The collection of a human robot collaboration dataset for cooperative assembly in glovebox environments (2025),https://arxiv.org/abs/2407.14649 16 A. Bhardwaj et al

arXiv 2025
[16]

Si, C., Liu, Y., Ai, B., Xie, J., Potamias, R.A., Zheng, C., Su, H.: Anyhand: A large-scale synthetic dataset for rgb(-d) hand pose estimation (2026),https:// arxiv.org/abs/2603.25726

Pith/arXiv arXiv 2026
[17]

Presentation at Stanford A.I

Sobel, I.: An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project 1968 (02 2014)

1968
[18]

Steckhan, N., Prajapati, K., Shao, W., Vock, S.: Semantic robustness probing via inpainting: An interactive tool for safety-critical object detection (2026),https: //arxiv.org/abs/2605.27155

Pith/arXiv arXiv 2026
[19]

Yolov8: A novel object detection algorithm with enhanced performance and robustness

Varghese, R., M., S.: Yolov8: A novel object detection algorithm with enhanced performance and robustness. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). pp. 1–6 (2024). https://doi.org/10.1109/ADICS58448.2024.10533619

work page doi:10.1109/adics58448.2024.10533619 2024
[20]

ACM Trans- actions on Multimedia Computing, Communications and Applications21(1) (Dec 2024).https://doi.org/10.1145/3637064,https://doi.org/10.1145/3637064

Westerski, A., Fong, W.T.: Synthetic data for object detection with neural net- works: State-of-the-art survey of domain randomisation techniques. ACM Trans- actions on Multimedia Computing, Communications and Applications21(1) (Dec 2024).https://doi.org/10.1145/3637064,https://doi.org/10.1145/3637064

work page doi:10.1145/3637064 2024
[21]

Forests17(3), 302 (2026).https://doi.org/10

Wołk, K., Avula, R., Narkilahti, A., Tatara, M., Niklewski, J., Żero, O.: Genera- tive ai and simulation-based data augmentation for enhanced object detection in low-data forestry environments. Forests17(3), 302 (2026).https://doi.org/10. 3390/f17030302

2026
[22]

IEEE Access12, 138441–138482 (2024).https://doi.org/10.1109/ ACCESS.2024.3461782

Yang, J., Ruhaiyem, N.I.R.: Review of deep learning-based image inpainting techniques. IEEE Access12, 138441–138482 (2024).https://doi.org/10.1109/ ACCESS.2024.3461782

arXiv 2024
[23]

Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C.L., Grundmann, M.: Mediapipe hands: On-device real-time hand tracking (2020), https://arxiv.org/abs/2006.10214

arXiv 2020
[24]

In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreason- able effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 586–595 (2018). https://doi.org/10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018
[25]

Showui: One vision-language- action model for GUI visual agent

Zhao,Z.,Yang,L.,Sun,P.,Hui,P.,Yao,A.:Analyzingthesynthetic-to-realdomain gap in 3d hand pose estimation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12255–12265 (2025).https://doi. org/10.1109/CVPR52734.2025.01144

work page doi:10.1109/cvpr52734.2025.01144 2025

[1] [1]

Afifi, M.: 11k hands: Gender recognition and biometric identification using a large dataset of hand images (2018),https://arxiv.org/abs/1711.04322

Pith/arXiv arXiv 2018

[2] [2]

In: Nature Scientific Data

Chen, Z., Chen, H., Ouyang, Y., Cao, C., Gao, W., Hu, Q., Jin, H., Zhang, S.: A high-resolution and whole-body dataset of hand-object contact areas based on 3D scanning method. In: Nature Scientific Data. Nature Scientific Data, vol. 12, p. 451 (Mar 2025).https://doi.org/10.1038/s41597-025-04770-x

work page doi:10.1038/s41597-025-04770-x 2025

[3] [3]

Hillsdale, NJ: L

Cohen, J.: Statistical power analysis for the behavioral sciences. Hillsdale, NJ: L. Erlbaum Associates (1988) Evaluation of Generative Data 15

1988

[4] [4]

European radiology experimental4(1), 18 (2020).https://doi.org/10.1186/s41747-020-0145-y

Di Leo, G., Sardanelli, F.: Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. European radiology experimental4(1), 18 (2020).https://doi.org/10.1186/s41747-020-0145-y

work page doi:10.1186/s41747-020-0145-y 2020

[5] [5]

IEEE Access9, 11358–11371 (2021).https://doi.org/10.1109/ACCESS.2020.3048315

Faragallah, O.S., El-Hoseny, H., El-Shafai, W., El-Rahman, W.A., El-Sayed, H.S., El-Rabaie, E.S.M., El-Samie, F.E.A., Geweid, G.G.N.: A comprehensive survey analysis for present solutions of medical image fusion and future directions. IEEE Access9, 11358–11371 (2021).https://doi.org/10.1109/ACCESS.2020.3048315

work page doi:10.1109/access.2020.3048315 2021

[6] [6]

Robotics and Computer- Integrated Manufacturing94, 102957 (2025).https://doi.org/10.1016/j.rcim

Hubert, C., Odica, N., Noel, M., Gharib, S., Zargarbashi, S.H., Séoud, L.: Mu- vih: Multi-view hand gesture dataset and recognition pipeline for human–robot interaction in a collaborative robotic finishing platform. Robotics and Computer- Integrated Manufacturing94, 102957 (2025).https://doi.org/10.1016/j.rcim. 2025.102957

work page doi:10.1016/j.rcim 2025

[7] [7]

Islam, M.S., Shaqib, S., Ramit, S.S., Khushbu, S.A., Sattar, A., Noori, S.R.H.: A deep learning approach to detect complete safety equipment for construction workers based on yolov7 (2024),https://arxiv.org/abs/2406.07707

arXiv 2024

[8] [8]

Robotics and Computer-Integrated Manufactur- ing97, 103110 (2026).https://doi.org/10.1016/j.rcim.2025.103110,https: //www.sciencedirect.com/science/article/pii/S0736584525001644

Jalayer, R., Jalayer, M., Orsenigo, C., Tomizuka, M.: A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recogni- tion in human–robot interaction. Robotics and Computer-Integrated Manufactur- ing97, 103110 (2026).https://doi.org/10.1016/j.rcim.2025.103110,https: //www.sciencedirect.com/science/article/pii/S073658...

work page doi:10.1016/j.rcim.2025.103110 2026

[9] [9]

Kumar, M.E

Kuhn, H.W.: The hungarian method for the assignment problem. Naval Re- search Logistics Quarterly2(1-2), 83–97 (1955).https://doi.org/10.1002/ nav.3800020109,https://onlinelibrary.wiley.com/doi/abs/10.1002/nav. 3800020109

work page doi:10.1002/nav 1955

[10] [10]

In: Proceedings of the 40th International Conference on Machine Learning

Kulinski, S., Inouye, D.I.: Towards explaining distribution shifts. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

2023

[11] [11]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXVI

Li, Y., Dong, X., Chen, C., Zhuang, W., Lyu, L.: A simple background augmen- tation method for object detection with diffusion model. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXVI. p. 462–479. Springer-Verlag, Berlin, Heidel- berg (2024).https://doi.org/10.1007/978-3-031-72848...

work page doi:10.1007/978-3-031-72848-8_27 2024

[12] [12]

In: Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XLVII

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVII. p. 38–55. Springer-Verlag, ...

work page doi:10.1007/978-3-031-72970-6_3 2024

[13] [13]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos (2024),https://arxiv.org/abs/2408.00714

Pith/arXiv arXiv 2024

[14] [14]

In: 2008 15th IEEE International Conference on Image Processing

Rouse, D.M., Hemami, S.S.: Understanding and simplifying the structural similar- ity metric. In: 2008 15th IEEE International Conference on Image Processing. pp. 1188–1191 (2008).https://doi.org/10.1109/ICIP.2008.4711973

work page doi:10.1109/icip.2008.4711973 2008

[15] [15]

Bhardwaj et al

Sharma, S., Huang, M., Nair, S., Wen, A., Petlowany, C., Moore, J., Wanna, S., Pryor, M.: The collection of a human robot collaboration dataset for cooperative assembly in glovebox environments (2025),https://arxiv.org/abs/2407.14649 16 A. Bhardwaj et al

arXiv 2025

[16] [16]

Si, C., Liu, Y., Ai, B., Xie, J., Potamias, R.A., Zheng, C., Su, H.: Anyhand: A large-scale synthetic dataset for rgb(-d) hand pose estimation (2026),https:// arxiv.org/abs/2603.25726

Pith/arXiv arXiv 2026

[17] [17]

Presentation at Stanford A.I

Sobel, I.: An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project 1968 (02 2014)

1968

[18] [18]

Steckhan, N., Prajapati, K., Shao, W., Vock, S.: Semantic robustness probing via inpainting: An interactive tool for safety-critical object detection (2026),https: //arxiv.org/abs/2605.27155

Pith/arXiv arXiv 2026

[19] [19]

Yolov8: A novel object detection algorithm with enhanced performance and robustness

Varghese, R., M., S.: Yolov8: A novel object detection algorithm with enhanced performance and robustness. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). pp. 1–6 (2024). https://doi.org/10.1109/ADICS58448.2024.10533619

work page doi:10.1109/adics58448.2024.10533619 2024

[20] [20]

ACM Trans- actions on Multimedia Computing, Communications and Applications21(1) (Dec 2024).https://doi.org/10.1145/3637064,https://doi.org/10.1145/3637064

Westerski, A., Fong, W.T.: Synthetic data for object detection with neural net- works: State-of-the-art survey of domain randomisation techniques. ACM Trans- actions on Multimedia Computing, Communications and Applications21(1) (Dec 2024).https://doi.org/10.1145/3637064,https://doi.org/10.1145/3637064

work page doi:10.1145/3637064 2024

[21] [21]

Forests17(3), 302 (2026).https://doi.org/10

Wołk, K., Avula, R., Narkilahti, A., Tatara, M., Niklewski, J., Żero, O.: Genera- tive ai and simulation-based data augmentation for enhanced object detection in low-data forestry environments. Forests17(3), 302 (2026).https://doi.org/10. 3390/f17030302

2026

[22] [22]

IEEE Access12, 138441–138482 (2024).https://doi.org/10.1109/ ACCESS.2024.3461782

Yang, J., Ruhaiyem, N.I.R.: Review of deep learning-based image inpainting techniques. IEEE Access12, 138441–138482 (2024).https://doi.org/10.1109/ ACCESS.2024.3461782

arXiv 2024

[23] [23]

Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C.L., Grundmann, M.: Mediapipe hands: On-device real-time hand tracking (2020), https://arxiv.org/abs/2006.10214

arXiv 2020

[24] [24]

In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreason- able effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 586–595 (2018). https://doi.org/10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018

[25] [25]

Showui: One vision-language- action model for GUI visual agent

Zhao,Z.,Yang,L.,Sun,P.,Hui,P.,Yao,A.:Analyzingthesynthetic-to-realdomain gap in 3d hand pose estimation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12255–12265 (2025).https://doi. org/10.1109/CVPR52734.2025.01144

work page doi:10.1109/cvpr52734.2025.01144 2025