TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

Bastian Grande; Hugo Sax; Rui Wang; Sophokles Ktistakis

arxiv: 2605.17638 · v1 · pith:6DJ2ACGOnew · submitted 2026-05-17 · 💻 cs.CV

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

Sophokles Ktistakis , Rui Wang , Bastian Grande , Hugo Sax This is my paper

Pith reviewed 2026-05-20 13:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords hand-surface contact3D reconstructionoperating roommulti-view visioncontact detectionmedical proceduresMANO hand modelsemantic mapping

0 comments

The pith

TouchMap-OR reconstructs which clinician touched which surface and when during operating room procedures from multi-view RGB-D data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the task of identity-resolved hand-surface interaction reconstruction in operating rooms and presents TouchMap-OR as a system that solves it. The work matters because these contacts drive pathogen transmission yet current practices rely on incomplete manual logs. TouchMap-OR builds consistent 3D tracks of multiple clinicians, fits articulated MANO hand meshes to RGB-D observations, fuses them into a semantic 3D model of the room, and detects contact episodes from temporal hand-surface proximity. On three real anesthesia induction recordings with manual annotations, the system reaches 0.75 binary contact F1 while preserving multi-person tracking performance and attaining 0.96 identity attribution accuracy.

Core claim

By reconstructing globally consistent multi-person 3D skeleton tracks, estimating and fusing articulated MANO hand meshes aligned to depth, and mapping the resulting hand trajectories onto a semantic 3D model of the operating room built from multi-view segmentation and depth fusion, TouchMap-OR infers contact episodes that record which clinician touched which surface and when, achieving 0.75 binary contact F1 on annotated data from three real procedures.

What carries the argument

Fusion of multi-view 3D hand trajectories with a semantic 3D room model followed by temporal proximity detection to infer contacts.

If this is right

Detailed contact histories become available automatically without continuous human observation.
Contacts can be attributed to specific clinicians at 0.96 accuracy.
Multi-person 3D tracking accuracy stays comparable to existing baselines.
Contact detection improves over simpler tracking-only methods.
The pipeline runs on real clinical data from anesthesia procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proximity-mapping idea could be tested in other contact-heavy environments such as laboratories or food handling areas.
Adding pressure or force data from surfaces would provide an independent check on whether detected proximities correspond to touches.
Long-term collection of such maps might reveal patterns that suggest changes to equipment layout or procedure flow.

Load-bearing premise

Hand-surface proximity measured in the reconstructed 3D trajectories accurately indicates actual physical contacts rather than near-misses.

What would settle it

Collect new recordings that include independent contact sensors on surfaces and check whether the system's proximity-based contact events match the sensor-detected events.

Figures

Figures reproduced from arXiv: 2605.17638 by Bastian Grande, Hugo Sax, Rui Wang, Sophokles Ktistakis.

**Figure 1.** Figure 1: From synchronized multi-view RGB-D observations (left), our system reconstructs a metric 3D representation of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the TouchMap-OR pipeline. Multi-view RGB-D cameras capture the operating room. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example sequence from our anesthesia induction dataset. Selected frames from three synchronized camera views (rows) illustrate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Binary contact detection performance as a function of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TouchMap-OR offers a practical pipeline for 3D contact mapping in operating rooms but evaluation on only three procedures weakens the performance claims.

read the letter

TouchMap-OR gives a workable way to reconstruct who touched what in the operating room using cameras, but its claims rest on data from just three procedures. The new element is the end-to-end system for identity-resolved hand-surface interaction reconstruction in ORs. It pulls together multi-view RGB-D fusion for consistent 3D tracks, MANO for articulated hands, semantic segmentation for environment mapping, and proximity to detect contacts. This targets a real need in tracking pathogen transmission that manual observation cannot handle at scale. The paper does well by evaluating on actual recordings from three anesthesia inductions with independent manual annotations for contacts. It shows the system outperforming simple tracking baselines on contact F1 while keeping comparable tracking accuracy and high identity attribution at 0.96. The use of real data and baseline comparisons gives some grounding to the approach. The main limitation is the small number of procedures. Results from N=3 can easily be influenced by specific OR configurations, clinician styles, or equipment placements, making it unclear if the 0.75 F1 generalizes. Without reported variance across procedures or additional validation sets, the quantitative evidence stays moderate. The assumption that hand-surface proximity in the 3D model corresponds to actual contacts is validated against annotations in this study, but depth inaccuracies or hovering hands could introduce errors not fully explored. This paper is for computer vision people working on healthcare monitoring or applied 3D reconstruction. Anyone interested in contact logging for medical safety could extract value from the described fusion and inference steps. I would send it to peer review. The practical focus and real data make it worth a referee's time, with the expectation that reviewers will push for stronger evaluation evidence.

Referee Report

1 major / 2 minor

Summary. The paper introduces TouchMap-OR, a multi-view RGB-D system for reconstructing identity-resolved hand-surface contacts in operating rooms. It reconstructs globally consistent multi-person 3D skeleton tracks, estimates articulated MANO hand meshes aligned to depth, builds a semantic 3D model of the OR via multi-view segmentation and depth fusion, and infers contact episodes from temporal hand-surface proximity. Evaluation on recordings from three real anesthesia inductions with manual annotations yields a binary contact F1 of 0.75 (outperforming tracking baselines), comparable multi-person tracking accuracy, and 0.96 identity attribution accuracy.

Significance. If the performance generalizes, the approach could provide automated, detailed contact histories for infection control studies in clinical environments, where manual observation is currently the norm. The combination of articulated hand modeling, multi-view fusion, and semantic surface mapping offers a concrete technical contribution to 3D reconstruction of dynamic human-object interactions.

major comments (1)

[Evaluation] Evaluation section (and abstract): the headline 0.75 binary contact F1 and baseline outperformance are computed on recordings from only three anesthesia inductions. No cross-procedure variance, leave-one-procedure-out results, or external validation set is reported, so it remains possible that the proximity heuristic aligns with annotation patterns specific to these three cases rather than reflecting a robust property of the pipeline. This directly weakens support for the central claim of reliable contact reconstruction across varied OR conditions.

minor comments (2)

[Methods] Abstract and methods: the description of how multi-view hand reconstructions are fused and associated with tracked clinicians would benefit from an explicit statement of the association criterion (e.g., distance threshold or appearance cue) and any handling of hand occlusions.
[Methods] Figure captions and text: several references to 'movable objects' appear without clarifying whether the semantic model treats them as static or updates their poses across frames.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential clinical relevance of TouchMap-OR. We respond to the single major comment below.

read point-by-point responses

Referee: [Evaluation] Evaluation section (and abstract): the headline 0.75 binary contact F1 and baseline outperformance are computed on recordings from only three anesthesia inductions. No cross-procedure variance, leave-one-procedure-out results, or external validation set is reported, so it remains possible that the proximity heuristic aligns with annotation patterns specific to these three cases rather than reflecting a robust property of the pipeline. This directly weakens support for the central claim of reliable contact reconstruction across varied OR conditions.

Authors: We acknowledge that the evaluation relies on only three real anesthesia induction procedures and that the manuscript does not report cross-procedure variance, leave-one-procedure-out results, or external validation. This is a genuine limitation stemming from the practical difficulties of recording and annotating data inside actual operating rooms. In the revised manuscript we will add per-procedure performance breakdowns and a dedicated limitations paragraph that explicitly discusses the small sample size and the risk of procedure-specific annotation patterns. We will also clarify that the three recordings involved different clinician teams and varying OR configurations. We agree that broader validation would be needed to fully support claims of robustness across varied OR conditions and intend to pursue additional data collection in follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's pipeline reconstructs multi-person 3D tracks and MANO hand meshes from RGB-D input, builds a semantic OR model via segmentation and depth fusion, then applies a proximity heuristic to label contact episodes. These labels are compared to independent manual annotations on three procedures to compute F1 and accuracy metrics. No equation or step defines a quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a self-citation chain for a uniqueness claim; the central empirical result is measured against external ground-truth events rather than reducing to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are detailed; the approach relies on standard components like MANO hand models and multi-view fusion techniques from prior computer vision literature.

pith-pipeline@v0.9.0 · 5783 in / 1177 out tokens · 36926 ms · 2026-05-20T13:35:01.713342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

The use of privacy-protected computer vision to mea- sure the quality of healthcare worker hand hygiene

Sari Awwad, Sanjay Tarvade, Massimo Piccardi, and David J Gattas. The use of privacy-protected computer vision to mea- sure the quality of healthcare worker hand hygiene. Inter- national Journal for Quality in Health Care , 31(1):36–42,

work page
[2]

Crandall, and Chen Yu

Sven Bambach, Stefan Lee, David J. Crandall, and Chen Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In 2015 IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1949– 1957, 2015. 2

work page 2015
[3]

3d pictorial structures for multiple human pose estimation

Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures for multiple human pose estimation. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1669–1676, 2014. 3

work page 2014
[4]

3d pictorial structures revisited: Multiple human pose estimation

Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures revisited: Multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence , 38(10):1929–1942, 2015. 3

work page 1929
[5]

3d hand shape and pose from images in the wild

Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843–10852, 2019. 3

work page 2019
[6]

Openpose: Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence , 43(1):172–186,

work page
[7]

Using computer vision and depth sensing to measure healthcare worker-patient contacts and personal protective equipment adherence within hospi- tal rooms

Junyang Chen, James F Cremer, Kasra Zarei, Alberto M Segre, and Philip M Polgreen. Using computer vision and depth sensing to measure healthcare worker-patient contacts and personal protective equipment adherence within hospi- tal rooms. In Open forum infectious diseases , page ofv200. Oxford University Press, 2016. 3

work page 2016
[8]

Privacy-Preserving Action Recognition for Smart Hospitals using Low-Resolution Depth Images

Edward Chou, Matthew Tan, Cherry Zou, Michelle Guo, Albert Haque, Arnold Milstein, and Li Fei-Fei. Privacy- preserving action recognition for smart hospitals using low- resolution depth images. arXiv preprint arXiv:1811.09950,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Fast and robust multi-person 3d pose estima- tion from multiple views

Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estima- tion from multiple views. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7792–7801, 2019. 3, 6

work page 2019
[10]

Fast and robust multi-person 3d pose estimation and tracking from multiple views

Junting Dong, Qi Fang, Wen Jiang, Yurou Yang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE transactions on pattern analysis and machine intelligence, 44(10):6981–6992, 2021. 3, 6

work page 2021
[11]

Towards vision- based smart hospitals: a system for tracking and monitoring hand hygiene compliance

Albert Haque, Michelle Guo, Alexandre Alahi, Serena Ye- ung, Zelun Luo, Alisha Rege, Jeffrey Jopling, Lance Down- ing, William Beninati, Amit Singh, et al. Towards vision- based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Machine Learning for Health- care Conference, pages 75–87. PMLR, 2017. 2

work page 2017
[12]

Learning joint reconstruction of hands and manipulated ob- jects

Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated ob- jects. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 11807–11816,

work page
[13]

Hand pose estimation via latent 2.5 d heatmap regression

Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European conference on computer vision (ECCV), pages 118–134, 2018. 3

work page 2018
[14]

Coen, Dinah J

Annette Jeanes, Pietro G. Coen, Dinah J. Gould, and Nico- las S. Drey. Validity of hand hygiene compliance measure- ment by observation: A systematic review. American Jour- nal of Infection Control, 47(3):313–322, 2019. 2

work page 2019
[15]

End-to-end recovery of human shape and pose

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018. 3

work page 2018
[16]

Video-based au- tomatic hand hygiene detection for operating rooms using 3d convolutional neural networks

Minjee Kim, Joonmyeong Choi, Jun-Young Jo, Wook-Jong Kim, Sung-Hoon Kim, and Namkug Kim. Video-based au- tomatic hand hygiene detection for operating rooms using 3d convolutional neural networks. Journal of Clinical Monitor- ing and Computing, 38(5):1187–1197, 2024. 2

work page 2024
[17]

Rtmo: Towards high-performance one- stage real-time multi-person pose estimation

Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, and Wenming Yang. Rtmo: Towards high-performance one- stage real-time multi-person pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1491–1500, 2024. 3

work page 2024
[18]

A simple yet effective baseline for 3d human pose esti- mation

Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose esti- mation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 3

work page 2017
[19]

Detecting hands and recognizing physical contact in the wild

Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. Detecting hands and recognizing physical contact in the wild. Advances in neural information processing sys- tems, 33:7841–7851, 2020. 2

work page 2020
[20]

Stacked hour- glass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. InEuropean con- ference on computer vision, pages 483–499. Springer, 2016. 3

work page 2016
[21]

4d-or: Semantic scene graphs for or domain modeling

Ege ¨Ozsoy, Evin Pınar ¨Ornek, Ulrich Eck, Tobias Czempiel, Federico Tombari, and Nassir Navab. 4d-or: Semantic scene graphs for or domain modeling. In International conference on medical image computing and computer-assisted inter- vention, pages 475–485. Springer, 2022. 2

work page 2022
[22]

Coarse-to-fine volumetric pre- diction for single-image 3d human pose

Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa- nis, and Kostas Daniilidis. Coarse-to-fine volumetric pre- diction for single-image 3d human pose. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7025–7034, 2017. 3

work page 2017
[23]

Wilor: End-to-end 3d hand localization and reconstruction in-the-wild

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 3

work page 2025
[24]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether. arXiv preprint arXiv:2201.02610, 2022. 3

work page arXiv 2022
[26]

‘my five moments for hand hy- giene’: a user-centred design approach to understand, train, monitor and report hand hygiene

Hugo Sax, Benedetta Allegranzi, Ilker Uckay, E Larson, J Boyce, and Didier Pittet. ‘my five moments for hand hy- giene’: a user-centred design approach to understand, train, monitor and report hand hygiene. Journal of Hospital infec- tion, 67(1):9–21, 2007. 2

work page 2007
[27]

Who ‘my five moments for hand hygiene’in anaesthesia induction: a video-based analysis reveals novel system challenges and de- sign opportunities

Jan B Schmutz, Bastian Grande, and Hugo Sax. Who ‘my five moments for hand hygiene’in anaesthesia induction: a video-based analysis reveals novel system challenges and de- sign opportunities. Journal of Hospital Infection , 135:163– 170, 2023. 2

work page 2023
[28]

Understanding human hands in contact at inter- net scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at inter- net scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878,

work page
[29]

Hand keypoint detection in single images using mul- tiview bootstrapping

Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using mul- tiview bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 1145– 1153, 2017. 3

work page 2017
[30]

Automatic de- tection of hand hygiene using computer vision technology

Amit Singh, Albert Haque, Alexandre Alahi, Serena Ye- ung, Michelle Guo, Jill R Glassman, William Beninati, Terry Platchek, Li Fei-Fei, and Arnold Milstein. Automatic de- tection of hand hygiene using computer vision technology. Journal of the American Medical Informatics Association , 27(8):1316–1320, 2020. 2

work page 2020
[31]

Deeppose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1653–1660, 2014. 3

work page 2014
[32]

En- donet: a deep architecture for recognition tasks on laparo- scopic videos

Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy. En- donet: a deep architecture for recognition tasks on laparo- scopic videos. IEEE transactions on medical imaging , 36 (1):86–97, 2016. 2

work page 2016
[33]

4d association graph for realtime multi-person motion capture using multiple video cameras

Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1324–1333, 2020. 3, 6

work page 2020
[34]

Learning to esti- mate 3d hand pose from single rgb images

Christian Zimmermann and Thomas Brox. Learning to esti- mate 3d hand pose from single rgb images. InProceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017. 3

work page 2017

[1] [1]

The use of privacy-protected computer vision to mea- sure the quality of healthcare worker hand hygiene

Sari Awwad, Sanjay Tarvade, Massimo Piccardi, and David J Gattas. The use of privacy-protected computer vision to mea- sure the quality of healthcare worker hand hygiene. Inter- national Journal for Quality in Health Care , 31(1):36–42,

work page

[2] [2]

Crandall, and Chen Yu

Sven Bambach, Stefan Lee, David J. Crandall, and Chen Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In 2015 IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1949– 1957, 2015. 2

work page 2015

[3] [3]

3d pictorial structures for multiple human pose estimation

Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures for multiple human pose estimation. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1669–1676, 2014. 3

work page 2014

[4] [4]

3d pictorial structures revisited: Multiple human pose estimation

Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures revisited: Multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence , 38(10):1929–1942, 2015. 3

work page 1929

[5] [5]

3d hand shape and pose from images in the wild

Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843–10852, 2019. 3

work page 2019

[6] [6]

Openpose: Realtime multi-person 2d pose estimation using part affinity fields

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence , 43(1):172–186,

work page

[7] [7]

Using computer vision and depth sensing to measure healthcare worker-patient contacts and personal protective equipment adherence within hospi- tal rooms

Junyang Chen, James F Cremer, Kasra Zarei, Alberto M Segre, and Philip M Polgreen. Using computer vision and depth sensing to measure healthcare worker-patient contacts and personal protective equipment adherence within hospi- tal rooms. In Open forum infectious diseases , page ofv200. Oxford University Press, 2016. 3

work page 2016

[8] [8]

Privacy-Preserving Action Recognition for Smart Hospitals using Low-Resolution Depth Images

Edward Chou, Matthew Tan, Cherry Zou, Michelle Guo, Albert Haque, Arnold Milstein, and Li Fei-Fei. Privacy- preserving action recognition for smart hospitals using low- resolution depth images. arXiv preprint arXiv:1811.09950,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Fast and robust multi-person 3d pose estima- tion from multiple views

Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estima- tion from multiple views. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7792–7801, 2019. 3, 6

work page 2019

[10] [10]

Fast and robust multi-person 3d pose estimation and tracking from multiple views

Junting Dong, Qi Fang, Wen Jiang, Yurou Yang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE transactions on pattern analysis and machine intelligence, 44(10):6981–6992, 2021. 3, 6

work page 2021

[11] [11]

Towards vision- based smart hospitals: a system for tracking and monitoring hand hygiene compliance

Albert Haque, Michelle Guo, Alexandre Alahi, Serena Ye- ung, Zelun Luo, Alisha Rege, Jeffrey Jopling, Lance Down- ing, William Beninati, Amit Singh, et al. Towards vision- based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Machine Learning for Health- care Conference, pages 75–87. PMLR, 2017. 2

work page 2017

[12] [12]

Learning joint reconstruction of hands and manipulated ob- jects

Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated ob- jects. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 11807–11816,

work page

[13] [13]

Hand pose estimation via latent 2.5 d heatmap regression

Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European conference on computer vision (ECCV), pages 118–134, 2018. 3

work page 2018

[14] [14]

Coen, Dinah J

Annette Jeanes, Pietro G. Coen, Dinah J. Gould, and Nico- las S. Drey. Validity of hand hygiene compliance measure- ment by observation: A systematic review. American Jour- nal of Infection Control, 47(3):313–322, 2019. 2

work page 2019

[15] [15]

End-to-end recovery of human shape and pose

Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018. 3

work page 2018

[16] [16]

Video-based au- tomatic hand hygiene detection for operating rooms using 3d convolutional neural networks

Minjee Kim, Joonmyeong Choi, Jun-Young Jo, Wook-Jong Kim, Sung-Hoon Kim, and Namkug Kim. Video-based au- tomatic hand hygiene detection for operating rooms using 3d convolutional neural networks. Journal of Clinical Monitor- ing and Computing, 38(5):1187–1197, 2024. 2

work page 2024

[17] [17]

Rtmo: Towards high-performance one- stage real-time multi-person pose estimation

Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, and Wenming Yang. Rtmo: Towards high-performance one- stage real-time multi-person pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1491–1500, 2024. 3

work page 2024

[18] [18]

A simple yet effective baseline for 3d human pose esti- mation

Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose esti- mation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 3

work page 2017

[19] [19]

Detecting hands and recognizing physical contact in the wild

Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. Detecting hands and recognizing physical contact in the wild. Advances in neural information processing sys- tems, 33:7841–7851, 2020. 2

work page 2020

[20] [20]

Stacked hour- glass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. InEuropean con- ference on computer vision, pages 483–499. Springer, 2016. 3

work page 2016

[21] [21]

4d-or: Semantic scene graphs for or domain modeling

Ege ¨Ozsoy, Evin Pınar ¨Ornek, Ulrich Eck, Tobias Czempiel, Federico Tombari, and Nassir Navab. 4d-or: Semantic scene graphs for or domain modeling. In International conference on medical image computing and computer-assisted inter- vention, pages 475–485. Springer, 2022. 2

work page 2022

[22] [22]

Coarse-to-fine volumetric pre- diction for single-image 3d human pose

Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa- nis, and Kostas Daniilidis. Coarse-to-fine volumetric pre- diction for single-image 3d human pose. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7025–7034, 2017. 3

work page 2017

[23] [23]

Wilor: End-to-end 3d hand localization and reconstruction in-the-wild

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 3

work page 2025

[24] [24]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether. arXiv preprint arXiv:2201.02610, 2022. 3

work page arXiv 2022

[26] [26]

‘my five moments for hand hy- giene’: a user-centred design approach to understand, train, monitor and report hand hygiene

Hugo Sax, Benedetta Allegranzi, Ilker Uckay, E Larson, J Boyce, and Didier Pittet. ‘my five moments for hand hy- giene’: a user-centred design approach to understand, train, monitor and report hand hygiene. Journal of Hospital infec- tion, 67(1):9–21, 2007. 2

work page 2007

[27] [27]

Who ‘my five moments for hand hygiene’in anaesthesia induction: a video-based analysis reveals novel system challenges and de- sign opportunities

Jan B Schmutz, Bastian Grande, and Hugo Sax. Who ‘my five moments for hand hygiene’in anaesthesia induction: a video-based analysis reveals novel system challenges and de- sign opportunities. Journal of Hospital Infection , 135:163– 170, 2023. 2

work page 2023

[28] [28]

Understanding human hands in contact at inter- net scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at inter- net scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878,

work page

[29] [29]

Hand keypoint detection in single images using mul- tiview bootstrapping

Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using mul- tiview bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 1145– 1153, 2017. 3

work page 2017

[30] [30]

Automatic de- tection of hand hygiene using computer vision technology

Amit Singh, Albert Haque, Alexandre Alahi, Serena Ye- ung, Michelle Guo, Jill R Glassman, William Beninati, Terry Platchek, Li Fei-Fei, and Arnold Milstein. Automatic de- tection of hand hygiene using computer vision technology. Journal of the American Medical Informatics Association , 27(8):1316–1320, 2020. 2

work page 2020

[31] [31]

Deeppose: Human pose estimation via deep neural networks

Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1653–1660, 2014. 3

work page 2014

[32] [32]

En- donet: a deep architecture for recognition tasks on laparo- scopic videos

Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy. En- donet: a deep architecture for recognition tasks on laparo- scopic videos. IEEE transactions on medical imaging , 36 (1):86–97, 2016. 2

work page 2016

[33] [33]

4d association graph for realtime multi-person motion capture using multiple video cameras

Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1324–1333, 2020. 3, 6

work page 2020

[34] [34]

Learning to esti- mate 3d hand pose from single rgb images

Christian Zimmermann and Thomas Brox. Learning to esti- mate 3d hand pose from single rgb images. InProceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017. 3

work page 2017