pith. sign in

arxiv: 2605.17638 · v1 · pith:6DJ2ACGOnew · submitted 2026-05-17 · 💻 cs.CV

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

Pith reviewed 2026-05-20 13:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords hand-surface contact3D reconstructionoperating roommulti-view visioncontact detectionmedical proceduresMANO hand modelsemantic mapping
0
0 comments X

The pith

TouchMap-OR reconstructs which clinician touched which surface and when during operating room procedures from multi-view RGB-D data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the task of identity-resolved hand-surface interaction reconstruction in operating rooms and presents TouchMap-OR as a system that solves it. The work matters because these contacts drive pathogen transmission yet current practices rely on incomplete manual logs. TouchMap-OR builds consistent 3D tracks of multiple clinicians, fits articulated MANO hand meshes to RGB-D observations, fuses them into a semantic 3D model of the room, and detects contact episodes from temporal hand-surface proximity. On three real anesthesia induction recordings with manual annotations, the system reaches 0.75 binary contact F1 while preserving multi-person tracking performance and attaining 0.96 identity attribution accuracy.

Core claim

By reconstructing globally consistent multi-person 3D skeleton tracks, estimating and fusing articulated MANO hand meshes aligned to depth, and mapping the resulting hand trajectories onto a semantic 3D model of the operating room built from multi-view segmentation and depth fusion, TouchMap-OR infers contact episodes that record which clinician touched which surface and when, achieving 0.75 binary contact F1 on annotated data from three real procedures.

What carries the argument

Fusion of multi-view 3D hand trajectories with a semantic 3D room model followed by temporal proximity detection to infer contacts.

If this is right

  • Detailed contact histories become available automatically without continuous human observation.
  • Contacts can be attributed to specific clinicians at 0.96 accuracy.
  • Multi-person 3D tracking accuracy stays comparable to existing baselines.
  • Contact detection improves over simpler tracking-only methods.
  • The pipeline runs on real clinical data from anesthesia procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proximity-mapping idea could be tested in other contact-heavy environments such as laboratories or food handling areas.
  • Adding pressure or force data from surfaces would provide an independent check on whether detected proximities correspond to touches.
  • Long-term collection of such maps might reveal patterns that suggest changes to equipment layout or procedure flow.

Load-bearing premise

Hand-surface proximity measured in the reconstructed 3D trajectories accurately indicates actual physical contacts rather than near-misses.

What would settle it

Collect new recordings that include independent contact sensors on surfaces and check whether the system's proximity-based contact events match the sensor-detected events.

Figures

Figures reproduced from arXiv: 2605.17638 by Bastian Grande, Hugo Sax, Rui Wang, Sophokles Ktistakis.

Figure 1
Figure 1. Figure 1: From synchronized multi-view RGB-D observations (left), our system reconstructs a metric 3D representation of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TouchMap-OR pipeline. Multi-view RGB-D cameras capture the operating room. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example sequence from our anesthesia induction dataset. Selected frames from three synchronized camera views (rows) illustrate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Binary contact detection performance as a function of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces TouchMap-OR, a multi-view RGB-D system for reconstructing identity-resolved hand-surface contacts in operating rooms. It reconstructs globally consistent multi-person 3D skeleton tracks, estimates articulated MANO hand meshes aligned to depth, builds a semantic 3D model of the OR via multi-view segmentation and depth fusion, and infers contact episodes from temporal hand-surface proximity. Evaluation on recordings from three real anesthesia inductions with manual annotations yields a binary contact F1 of 0.75 (outperforming tracking baselines), comparable multi-person tracking accuracy, and 0.96 identity attribution accuracy.

Significance. If the performance generalizes, the approach could provide automated, detailed contact histories for infection control studies in clinical environments, where manual observation is currently the norm. The combination of articulated hand modeling, multi-view fusion, and semantic surface mapping offers a concrete technical contribution to 3D reconstruction of dynamic human-object interactions.

major comments (1)
  1. [Evaluation] Evaluation section (and abstract): the headline 0.75 binary contact F1 and baseline outperformance are computed on recordings from only three anesthesia inductions. No cross-procedure variance, leave-one-procedure-out results, or external validation set is reported, so it remains possible that the proximity heuristic aligns with annotation patterns specific to these three cases rather than reflecting a robust property of the pipeline. This directly weakens support for the central claim of reliable contact reconstruction across varied OR conditions.
minor comments (2)
  1. [Methods] Abstract and methods: the description of how multi-view hand reconstructions are fused and associated with tracked clinicians would benefit from an explicit statement of the association criterion (e.g., distance threshold or appearance cue) and any handling of hand occlusions.
  2. [Methods] Figure captions and text: several references to 'movable objects' appear without clarifying whether the semantic model treats them as static or updates their poses across frames.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential clinical relevance of TouchMap-OR. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (and abstract): the headline 0.75 binary contact F1 and baseline outperformance are computed on recordings from only three anesthesia inductions. No cross-procedure variance, leave-one-procedure-out results, or external validation set is reported, so it remains possible that the proximity heuristic aligns with annotation patterns specific to these three cases rather than reflecting a robust property of the pipeline. This directly weakens support for the central claim of reliable contact reconstruction across varied OR conditions.

    Authors: We acknowledge that the evaluation relies on only three real anesthesia induction procedures and that the manuscript does not report cross-procedure variance, leave-one-procedure-out results, or external validation. This is a genuine limitation stemming from the practical difficulties of recording and annotating data inside actual operating rooms. In the revised manuscript we will add per-procedure performance breakdowns and a dedicated limitations paragraph that explicitly discusses the small sample size and the risk of procedure-specific annotation patterns. We will also clarify that the three recordings involved different clinician teams and varying OR configurations. We agree that broader validation would be needed to fully support claims of robustness across varied OR conditions and intend to pursue additional data collection in follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's pipeline reconstructs multi-person 3D tracks and MANO hand meshes from RGB-D input, builds a semantic OR model via segmentation and depth fusion, then applies a proximity heuristic to label contact episodes. These labels are compared to independent manual annotations on three procedures to compute F1 and accuracy metrics. No equation or step defines a quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a self-citation chain for a uniqueness claim; the central empirical result is measured against external ground-truth events rather than reducing to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are detailed; the approach relies on standard components like MANO hand models and multi-view fusion techniques from prior computer vision literature.

pith-pipeline@v0.9.0 · 5783 in / 1177 out tokens · 36926 ms · 2026-05-20T13:35:01.713342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    The use of privacy-protected computer vision to mea- sure the quality of healthcare worker hand hygiene

    Sari Awwad, Sanjay Tarvade, Massimo Piccardi, and David J Gattas. The use of privacy-protected computer vision to mea- sure the quality of healthcare worker hand hygiene. Inter- national Journal for Quality in Health Care , 31(1):36–42,

  2. [2]

    Crandall, and Chen Yu

    Sven Bambach, Stefan Lee, David J. Crandall, and Chen Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In 2015 IEEE Interna- tional Conference on Computer Vision (ICCV), pages 1949– 1957, 2015. 2

  3. [3]

    3d pictorial structures for multiple human pose estimation

    Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures for multiple human pose estimation. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1669–1676, 2014. 3

  4. [4]

    3d pictorial structures revisited: Multiple human pose estimation

    Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures revisited: Multiple human pose estimation. IEEE transactions on pattern analysis and machine intelligence , 38(10):1929–1942, 2015. 3

  5. [5]

    3d hand shape and pose from images in the wild

    Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843–10852, 2019. 3

  6. [6]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence , 43(1):172–186,

  7. [7]

    Using computer vision and depth sensing to measure healthcare worker-patient contacts and personal protective equipment adherence within hospi- tal rooms

    Junyang Chen, James F Cremer, Kasra Zarei, Alberto M Segre, and Philip M Polgreen. Using computer vision and depth sensing to measure healthcare worker-patient contacts and personal protective equipment adherence within hospi- tal rooms. In Open forum infectious diseases , page ofv200. Oxford University Press, 2016. 3

  8. [8]

    Privacy-Preserving Action Recognition for Smart Hospitals using Low-Resolution Depth Images

    Edward Chou, Matthew Tan, Cherry Zou, Michelle Guo, Albert Haque, Arnold Milstein, and Li Fei-Fei. Privacy- preserving action recognition for smart hospitals using low- resolution depth images. arXiv preprint arXiv:1811.09950,

  9. [9]

    Fast and robust multi-person 3d pose estima- tion from multiple views

    Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estima- tion from multiple views. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7792–7801, 2019. 3, 6

  10. [10]

    Fast and robust multi-person 3d pose estimation and tracking from multiple views

    Junting Dong, Qi Fang, Wen Jiang, Yurou Yang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE transactions on pattern analysis and machine intelligence, 44(10):6981–6992, 2021. 3, 6

  11. [11]

    Towards vision- based smart hospitals: a system for tracking and monitoring hand hygiene compliance

    Albert Haque, Michelle Guo, Alexandre Alahi, Serena Ye- ung, Zelun Luo, Alisha Rege, Jeffrey Jopling, Lance Down- ing, William Beninati, Amit Singh, et al. Towards vision- based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Machine Learning for Health- care Conference, pages 75–87. PMLR, 2017. 2

  12. [12]

    Learning joint reconstruction of hands and manipulated ob- jects

    Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated ob- jects. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 11807–11816,

  13. [13]

    Hand pose estimation via latent 2.5 d heatmap regression

    Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European conference on computer vision (ECCV), pages 118–134, 2018. 3

  14. [14]

    Coen, Dinah J

    Annette Jeanes, Pietro G. Coen, Dinah J. Gould, and Nico- las S. Drey. Validity of hand hygiene compliance measure- ment by observation: A systematic review. American Jour- nal of Infection Control, 47(3):313–322, 2019. 2

  15. [15]

    End-to-end recovery of human shape and pose

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018. 3

  16. [16]

    Video-based au- tomatic hand hygiene detection for operating rooms using 3d convolutional neural networks

    Minjee Kim, Joonmyeong Choi, Jun-Young Jo, Wook-Jong Kim, Sung-Hoon Kim, and Namkug Kim. Video-based au- tomatic hand hygiene detection for operating rooms using 3d convolutional neural networks. Journal of Clinical Monitor- ing and Computing, 38(5):1187–1197, 2024. 2

  17. [17]

    Rtmo: Towards high-performance one- stage real-time multi-person pose estimation

    Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, and Wenming Yang. Rtmo: Towards high-performance one- stage real-time multi-person pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1491–1500, 2024. 3

  18. [18]

    A simple yet effective baseline for 3d human pose esti- mation

    Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose esti- mation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 3

  19. [19]

    Detecting hands and recognizing physical contact in the wild

    Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai Nguyen. Detecting hands and recognizing physical contact in the wild. Advances in neural information processing sys- tems, 33:7841–7851, 2020. 2

  20. [20]

    Stacked hour- glass networks for human pose estimation

    Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- glass networks for human pose estimation. InEuropean con- ference on computer vision, pages 483–499. Springer, 2016. 3

  21. [21]

    4d-or: Semantic scene graphs for or domain modeling

    Ege ¨Ozsoy, Evin Pınar ¨Ornek, Ulrich Eck, Tobias Czempiel, Federico Tombari, and Nassir Navab. 4d-or: Semantic scene graphs for or domain modeling. In International conference on medical image computing and computer-assisted inter- vention, pages 475–485. Springer, 2022. 2

  22. [22]

    Coarse-to-fine volumetric pre- diction for single-image 3d human pose

    Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa- nis, and Kostas Daniilidis. Coarse-to-fine volumetric pre- diction for single-image 3d human pose. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 7025–7034, 2017. 3

  23. [23]

    Wilor: End-to-end 3d hand localization and reconstruction in-the-wild

    Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 3

  24. [24]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3

  25. [25]

    Em- bodied hands: Modeling and capturing hands and bodies to- gether

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether. arXiv preprint arXiv:2201.02610, 2022. 3

  26. [26]

    ‘my five moments for hand hy- giene’: a user-centred design approach to understand, train, monitor and report hand hygiene

    Hugo Sax, Benedetta Allegranzi, Ilker Uckay, E Larson, J Boyce, and Didier Pittet. ‘my five moments for hand hy- giene’: a user-centred design approach to understand, train, monitor and report hand hygiene. Journal of Hospital infec- tion, 67(1):9–21, 2007. 2

  27. [27]

    Who ‘my five moments for hand hygiene’in anaesthesia induction: a video-based analysis reveals novel system challenges and de- sign opportunities

    Jan B Schmutz, Bastian Grande, and Hugo Sax. Who ‘my five moments for hand hygiene’in anaesthesia induction: a video-based analysis reveals novel system challenges and de- sign opportunities. Journal of Hospital Infection , 135:163– 170, 2023. 2

  28. [28]

    Understanding human hands in contact at inter- net scale

    Dandan Shan, Jiaqi Geng, Michelle Shu, and David F Fouhey. Understanding human hands in contact at inter- net scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9869–9878,

  29. [29]

    Hand keypoint detection in single images using mul- tiview bootstrapping

    Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using mul- tiview bootstrapping. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 1145– 1153, 2017. 3

  30. [30]

    Automatic de- tection of hand hygiene using computer vision technology

    Amit Singh, Albert Haque, Alexandre Alahi, Serena Ye- ung, Michelle Guo, Jill R Glassman, William Beninati, Terry Platchek, Li Fei-Fei, and Arnold Milstein. Automatic de- tection of hand hygiene using computer vision technology. Journal of the American Medical Informatics Association , 27(8):1316–1320, 2020. 2

  31. [31]

    Deeppose: Human pose estimation via deep neural networks

    Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1653–1660, 2014. 3

  32. [32]

    En- donet: a deep architecture for recognition tasks on laparo- scopic videos

    Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy. En- donet: a deep architecture for recognition tasks on laparo- scopic videos. IEEE transactions on medical imaging , 36 (1):86–97, 2016. 2

  33. [33]

    4d association graph for realtime multi-person motion capture using multiple video cameras

    Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1324–1333, 2020. 3, 6

  34. [34]

    Learning to esti- mate 3d hand pose from single rgb images

    Christian Zimmermann and Thomas Brox. Learning to esti- mate 3d hand pose from single rgb images. InProceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017. 3