pith. sign in

arxiv: 2605.23653 · v1 · pith:O3IDYDHCnew · submitted 2026-05-22 · 💻 cs.CV

ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction

Pith reviewed 2026-05-25 04:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords open surgeryskill assessment3D hand reconstructionexplainable AItemporal convolutional networksattention mechanismskinematic analysissurgical training
0
0 comments X

The pith

ExpOS learns temporal motion patterns from 3D hand reconstructions to predict open-surgery skill levels without expert-defined metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surgical training feedback currently depends on limited expert observation that does not scale. The paper shows that kinematic data extracted from 3D hand poses in videos can be processed automatically to rate performance on open-surgery tasks. A temporal convolutional network with attention pooling identifies the motion segments most predictive of skill, then fuses those with global statistics for the final prediction. The resulting scores correlate with expert ratings at up to 0.778 on fascial closure while also producing attention maps that indicate which behaviors drove each assessment. This removes the need for manually chosen metrics and supplies multi-level explanations of the output.

Core claim

ExpOS extracts hand poses and tool detections from 221 videos of three open-surgery tasks, derives kinematic descriptors and global motion statistics, models spatiotemporal hand-tool dynamics with a temporal convolutional backbone plus attention-based pooling, and fuses the resulting representations to predict skill level while generating frame-level importance maps.

What carries the argument

Attention-based pooling inside a temporal convolutional network applied to kinematic descriptors from 3D hand reconstruction, which produces frame-level importance maps and enables fusion with global motion statistics.

Load-bearing premise

The extracted 3D hand poses and tool detections together with the attention mechanism capture the motion characteristics that actually determine expert-rated skill.

What would settle it

A new collection of open-surgery videos rated independently by experts where the model's predicted skill scores show little or no correlation with those ratings would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.23653 by Idan Smoller, Roi Papo, Shlomi Laufer.

Figure 1
Figure 1. Figure 1: Overview of ExpOS explainability outputs: (A) 3D hand pose, (B) temporal attention highlights key video frames, and (C) SHAP-based global feature importance reveals which motion statistics most influenced the skill score. We propose ExpOS, an explainable framework for data-driven surgical skill assessment that enables automatic, feedback-oriented evaluation. 3D hand tra￾jectories and kinematic descriptors … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the procedures executed by the surgical residents A. Suturing, B. Knot Tying, C. Facial closure. A YOLO [12] model was trained to detect surgical tools (forceps, needle driver, scissors, simulator) using 731 annotated frames with augmentation to ensure robustness. For hand detection, we employed RoHans [17] pseudo-label self-training to refine a YOLO-based detector, which was then integrated in… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed framework pipeline. The input video is processed into per-frame features consisting of 3D hand keypoints and bounding box information for hands and tools. Additionally, statistical global features are also calculated. The per-frame features are fed into the MS-TCN++ module, which generates a sequence of hidden representations over time. Temporal Attention Pooling is applied to extr… view at source ↗
Figure 4
Figure 4. Figure 4: SHAP beeswarm plot for the Fascial Closure test set. Each point of a feature corresponds to a video, with SHAP values (x-axis) indicating the features impact on the predicted skill score and colors denoting feature values (blue = low, red = high). Task completion time, still forceps usage, and coordination between hands emerged as the most influential descriptors. 6 Discussion and Conclusion Our results de… view at source ↗
read the original abstract

Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ExpOS, a framework for automatic, explainable assessment of open-surgery skills from video. It extracts 3D hand poses and tool detections to derive kinematic descriptors and global motion statistics, models spatiotemporal dynamics with a temporal convolutional backbone plus attention-based pooling to produce frame-level importance maps, fuses these with global statistics to predict expert-rated skill levels, and supplies multi-level interpretability via attention weights and feature analysis. Evaluated on 221 videos of medical students performing three tasks, it reports correlations with expert ratings (best: r=0.778, R²=0.74 on fascial closure).

Significance. If the reported correlations prove robust under proper validation, the approach could meaningfully advance scalable, feedback-oriented surgical training by reducing reliance on constant expert observation while adding interpretability. The data-driven learning of temporal patterns from 3D kinematics, combined with explicit attention maps, is a constructive direction; however, the absence of baseline comparisons and validation details in the provided description limits immediate assessment of impact.

major comments (3)
  1. [Results] Results section: the reported correlations (r=0.778, R²=0.74 on fascial closure; similar figures across tasks) are presented without baseline methods, confidence intervals, or error bars, and without stating the train/test split strategy or subject-wise cross-validation. These omissions are load-bearing for the central claim of 'strong correlation' and generalizability.
  2. [Methods] Methods/Evaluation: no information is given on the number of expert raters, inter-rater reliability (e.g., ICC or Cohen's kappa), or how label variability was handled when training the supervised model. This directly affects the reliability of the target labels used to train the temporal attention model.
  3. [Methods] Methods: the claim that the framework 'rather than relying on expert-defined metrics, learns discriminative temporal patterns directly from motion data' requires clarification on whether any post-hoc feature selection or expert-derived thresholds were applied to the kinematic descriptors before fusion; if present, this would qualify the 'parameter-free' or purely data-driven assertion.
minor comments (2)
  1. [Abstract] Abstract and text: the phrase 'weakly-supervised temporal importance learning' is used but the model is trained end-to-end on expert skill labels; a brief clarification of the supervision level would improve precision.
  2. [Figures] Figure captions and text: ensure that attention-map visualizations are accompanied by quantitative metrics (e.g., overlap with annotated critical segments) rather than qualitative examples alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of validation and methodological clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Results] Results section: the reported correlations (r=0.778, R²=0.74 on fascial closure; similar figures across tasks) are presented without baseline methods, confidence intervals, or error bars, and without stating the train/test split strategy or subject-wise cross-validation. These omissions are load-bearing for the central claim of 'strong correlation' and generalizability.

    Authors: We agree these reporting elements are necessary to substantiate the claims. In the revised manuscript we will add baseline comparisons (including a non-attention temporal model and regression on global statistics alone), report 95% confidence intervals computed via bootstrapping, include error bars on figures, and explicitly describe the subject-wise cross-validation protocol used to ensure no leakage across individuals. revision: yes

  2. Referee: [Methods] Methods/Evaluation: no information is given on the number of expert raters, inter-rater reliability (e.g., ICC or Cohen's kappa), or how label variability was handled when training the supervised model. This directly affects the reliability of the target labels used to train the temporal attention model.

    Authors: The current manuscript omits these details. We will revise the Methods section to include the number of expert raters, describe how ratings were aggregated to produce the target labels, and either report inter-rater reliability if available from the dataset or explicitly note its absence as a limitation while clarifying that the provided expert ratings constitute the standard supervision signal. revision: yes

  3. Referee: [Methods] Methods: the claim that the framework 'rather than relying on expert-defined metrics, learns discriminative temporal patterns directly from motion data' requires clarification on whether any post-hoc feature selection or expert-derived thresholds were applied to the kinematic descriptors before fusion; if present, this would qualify the 'parameter-free' or purely data-driven assertion.

    Authors: No post-hoc feature selection or expert-derived thresholds were applied at any stage. Kinematic descriptors are computed directly from the 3D hand poses and tool detections; the temporal convolutional backbone with attention learns discriminative patterns end-to-end, and fusion with global statistics occurs without manual rules. We will add an explicit clarifying sentence in the Methods section to remove any ambiguity regarding the data-driven nature of the pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a supervised learning pipeline that extracts kinematic features from 3D hand and tool detections, then trains a temporal convolutional model with attention to predict externally provided expert skill ratings on held-out videos. The reported correlations (r = 0.778, R2 = 0.74) are direct statistical comparisons against those independent labels rather than any internal reconstruction, self-fit, or renamed input. No equations, self-citations, or uniqueness claims appear in the provided text that would reduce the performance numbers to a definitional tautology or fitted-input prediction. The derivation chain therefore remains self-contained against the external expert benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; all fields left empty.

pith-pipeline@v0.9.0 · 5764 in / 1025 out tokens · 40636 ms · 2026-05-25T04:47:57.602898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    IEEE Robotics and Automation Letters 8(2), 1755–1762 (2023)

    Anastasiou, D., Jin, Y., Stoyanov, D., Mazomenos, E.: Keep your eye on the best: Contrastive regression transformer for skill assessment in robotic surgery. IEEE Robotics and Automation Letters 8(2), 1755–1762 (2023)

  2. [2]

    International Journal of Computer Assisted Radiology and Surgery 18(7), 1279–1285 (2023)

    Bkheet, E., DAngelo, A.L., Goldbraikh, A., Laufer, S.: Using hand pose estimation to automate open surgery training feedback. International Journal of Computer Assisted Radiology and Surgery 18(7), 1279–1285 (2023)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visu- alization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 782–791 (2021)

  4. [4]

    The American journal of surgery 209(4), 645–651 (2015)

    D’Angelo, A.L.D., Rutherford, D.N., Ray, R.D., Laufer, S., Kwan, C., Cohen, E.R., Mason, A., Pugh, C.M.: Idle time: an underdeveloped performance metric for as- sessing surgical skill. The American journal of surgery 209(4), 645–651 (2015)

  5. [5]

    In: Proceedings of NAACL-HLT

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. In: Proceedings of NAACL-HLT. pp. 4171–4186. Association for Computational Linguistics (2019)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Diaz, R., Marathe, A.: Soft labels for ordinal regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

  7. [7]

    International Journal of Computer Assisted Radiology and Surgery 14, 1217–1225 (2019)

    Funke, I., Mees, S., Weitz, J., Speidel, S.: Video-based surgical skill assessment us- ing 3d convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery 14, 1217–1225 (2019)

  8. [8]

    In: MICCAI workshop: M2cai

    Gao, Y., Vedula, S.S., Reiley, C.E., Ahmidi, N., Varadarajan, B., Lin, H.C., Tao, L., Zappella, L., Béjar, B., Yuh, D.D., et al.: Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI workshop: M2cai. vol. 3, p. 3 (2014)

  9. [9]

    International Journal of Computer As- sisted Radiology and Surgery 17, 437–448 (2022)

    Goldbraikh, A., D’Angelo, A., Pugh, C., Laufer, S.: Video-based fully automatic assessment of open surgery suturing skills. International Journal of Computer As- sisted Radiology and Surgery 17, 437–448 (2022)

  10. [10]

    International Journal of Computer Assisted Radiology and Surgery (2024)

    Hoffmann, H., Funke, I., Peters, P., Venkatesh, D., Egger, J., Rivoir, D., Röhrig, R., Hölzle, F., Bodenstedt, S., Willemer, M., Speidel, S., Puladi, B.: Aixsuture: vision-based assessment of open suturing skills. International Journal of Computer Assisted Radiology and Surgery (2024)

  11. [11]

    International journal of computer assisted radiology and surgery 14(9), 1611–1617 (2019)

    Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Accurate and interpretable evaluation of surgical skills from kinematic data using fully con- volutional neural networks. International journal of computer assisted radiology and surgery 14(9), 1611–1617 (2019)

  12. [12]

    Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics YOLO (Jan 2023), https://github.com/ultralytics/ultralytics

  13. [13]

    NPJ digital medicine 5(1), 24 (2022)

    Lam, K., Chen, J., Wang, Z., Iqbal, F.M., Darzi, A., Lo, B., Purkayastha, S., Kin- ross, J.M.: Machine learning for technical skill assessment in surgery: a systematic review. NPJ digital medicine 5(1), 24 (2022)

  14. [14]

    IEEE Trans- actions on Pattern Analysis and Machine Intelligence pp

    Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: Ms-tcn++: Multi- stage temporal convolutional network for action segmentation. IEEE Trans- actions on Pattern Analysis and Machine Intelligence pp. 1–1 (2020). https://doi.org/10.1109/TPAMI.2020.3021756

  15. [15]

    In: Proceedings of the 31st International Conference on Neural Information Processing Systems

    Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 4768–4777. Curran Associates Inc. (2017) 10 Roi Papo et al

  16. [16]

    British journal of surgery 84(2), 273–278 (1997)

    Martin, J., Regehr, G., Reznick, R., Macrae, H., Murnaghan, J., Hutchison, C., Brown, M.: Objective structured assessment of technical skill (osats) for surgical residents. British journal of surgery 84(2), 273–278 (1997)

  17. [17]

    arXiv preprint arXiv:2501.08115 (2025)

    Papo, R., Gershov, S., Friedman, T., et al.: Rohan: Robust hand detection in operation room. arXiv preprint arXiv:2501.08115 (2025)

  18. [18]

    Electronics 13(8), 1430 (2024)

    Perikos, I., Tzafilkou, K., Grivokostopoulou, F.: Bert-based transformers for sentiment analysis: A comparative study of explainability methods. Electronics 13(8), 1430 (2024). https://doi.org/10.3390/electronics13081430, https://doi.org/10.3390/electronics13081430

  19. [19]

    arXiv preprint arXiv:2409.12259 (2024)

    Potamias, R.A., Zhang, J., Deng, J., Zafeiriou, S.: Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. arXiv preprint arXiv:2409.12259 (2024)

  20. [20]

    Analytical Chemistry 36(8), 1627–1639 (1964)

    Savitzky, A., Golay, M.J.E.: Smoothing and differentiation of data by simpli- fied least squares procedures. Analytical Chemistry 36(8), 1627–1639 (1964). https://doi.org/10.1021/ac60214a047

  21. [21]

    In: Interna- tional conference on information processing in computer-assisted interventions

    Tao, L., Elhamifar, E., Khudanpur, S., Hager, G.D., Vidal, R.: Sparse hidden markov models for surgical gesture classification and skill evaluation. In: Interna- tional conference on information processing in computer-assisted interventions. pp. 167–177. Springer (2012)

  22. [22]

    IEEE Transactions on Neural Networks and Learning Systems 32(11), 4793–4813 (2020)

    Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE Transactions on Neural Networks and Learning Systems 32(11), 4793–4813 (2020)

  23. [23]

    Annual review of biomedical engineering 19, 301–325 (2017)

    Vedula, S.S., Ishii, M., Hager, G.D.: Objective assessment of surgical technical skill and competency in the operating room. Annual review of biomedical engineering 19, 301–325 (2017)

  24. [24]

    The Journal of Defense Modeling and Simulation 19(2), 159– 171 (2022)

    Yanik, E., Intes, X., Kruger, U., Yan, P., Diller, D., Van Voorst, B., Makled, B., Norfleet, J., De, S.: Deep neural networks for the assessment of surgical skills: A systematic review. The Journal of Defense Modeling and Simulation 19(2), 159– 171 (2022)