ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction
Pith reviewed 2026-05-25 04:47 UTC · model grok-4.3
The pith
ExpOS learns temporal motion patterns from 3D hand reconstructions to predict open-surgery skill levels without expert-defined metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ExpOS extracts hand poses and tool detections from 221 videos of three open-surgery tasks, derives kinematic descriptors and global motion statistics, models spatiotemporal hand-tool dynamics with a temporal convolutional backbone plus attention-based pooling, and fuses the resulting representations to predict skill level while generating frame-level importance maps.
What carries the argument
Attention-based pooling inside a temporal convolutional network applied to kinematic descriptors from 3D hand reconstruction, which produces frame-level importance maps and enables fusion with global motion statistics.
Load-bearing premise
The extracted 3D hand poses and tool detections together with the attention mechanism capture the motion characteristics that actually determine expert-rated skill.
What would settle it
A new collection of open-surgery videos rated independently by experts where the model's predicted skill scores show little or no correlation with those ratings would falsify the central claim.
Figures
read the original abstract
Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ExpOS, a framework for automatic, explainable assessment of open-surgery skills from video. It extracts 3D hand poses and tool detections to derive kinematic descriptors and global motion statistics, models spatiotemporal dynamics with a temporal convolutional backbone plus attention-based pooling to produce frame-level importance maps, fuses these with global statistics to predict expert-rated skill levels, and supplies multi-level interpretability via attention weights and feature analysis. Evaluated on 221 videos of medical students performing three tasks, it reports correlations with expert ratings (best: r=0.778, R²=0.74 on fascial closure).
Significance. If the reported correlations prove robust under proper validation, the approach could meaningfully advance scalable, feedback-oriented surgical training by reducing reliance on constant expert observation while adding interpretability. The data-driven learning of temporal patterns from 3D kinematics, combined with explicit attention maps, is a constructive direction; however, the absence of baseline comparisons and validation details in the provided description limits immediate assessment of impact.
major comments (3)
- [Results] Results section: the reported correlations (r=0.778, R²=0.74 on fascial closure; similar figures across tasks) are presented without baseline methods, confidence intervals, or error bars, and without stating the train/test split strategy or subject-wise cross-validation. These omissions are load-bearing for the central claim of 'strong correlation' and generalizability.
- [Methods] Methods/Evaluation: no information is given on the number of expert raters, inter-rater reliability (e.g., ICC or Cohen's kappa), or how label variability was handled when training the supervised model. This directly affects the reliability of the target labels used to train the temporal attention model.
- [Methods] Methods: the claim that the framework 'rather than relying on expert-defined metrics, learns discriminative temporal patterns directly from motion data' requires clarification on whether any post-hoc feature selection or expert-derived thresholds were applied to the kinematic descriptors before fusion; if present, this would qualify the 'parameter-free' or purely data-driven assertion.
minor comments (2)
- [Abstract] Abstract and text: the phrase 'weakly-supervised temporal importance learning' is used but the model is trained end-to-end on expert skill labels; a brief clarification of the supervision level would improve precision.
- [Figures] Figure captions and text: ensure that attention-map visualizations are accompanied by quantitative metrics (e.g., overlap with annotated critical segments) rather than qualitative examples alone.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of validation and methodological clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Results] Results section: the reported correlations (r=0.778, R²=0.74 on fascial closure; similar figures across tasks) are presented without baseline methods, confidence intervals, or error bars, and without stating the train/test split strategy or subject-wise cross-validation. These omissions are load-bearing for the central claim of 'strong correlation' and generalizability.
Authors: We agree these reporting elements are necessary to substantiate the claims. In the revised manuscript we will add baseline comparisons (including a non-attention temporal model and regression on global statistics alone), report 95% confidence intervals computed via bootstrapping, include error bars on figures, and explicitly describe the subject-wise cross-validation protocol used to ensure no leakage across individuals. revision: yes
-
Referee: [Methods] Methods/Evaluation: no information is given on the number of expert raters, inter-rater reliability (e.g., ICC or Cohen's kappa), or how label variability was handled when training the supervised model. This directly affects the reliability of the target labels used to train the temporal attention model.
Authors: The current manuscript omits these details. We will revise the Methods section to include the number of expert raters, describe how ratings were aggregated to produce the target labels, and either report inter-rater reliability if available from the dataset or explicitly note its absence as a limitation while clarifying that the provided expert ratings constitute the standard supervision signal. revision: yes
-
Referee: [Methods] Methods: the claim that the framework 'rather than relying on expert-defined metrics, learns discriminative temporal patterns directly from motion data' requires clarification on whether any post-hoc feature selection or expert-derived thresholds were applied to the kinematic descriptors before fusion; if present, this would qualify the 'parameter-free' or purely data-driven assertion.
Authors: No post-hoc feature selection or expert-derived thresholds were applied at any stage. Kinematic descriptors are computed directly from the 3D hand poses and tool detections; the temporal convolutional backbone with attention learns discriminative patterns end-to-end, and fusion with global statistics occurs without manual rules. We will add an explicit clarifying sentence in the Methods section to remove any ambiguity regarding the data-driven nature of the pipeline. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes a supervised learning pipeline that extracts kinematic features from 3D hand and tool detections, then trains a temporal convolutional model with attention to predict externally provided expert skill ratings on held-out videos. The reported correlations (r = 0.778, R2 = 0.74) are direct statistical comparisons against those independent labels rather than any internal reconstruction, self-fit, or renamed input. No equations, self-citations, or uniqueness claims appear in the provided text that would reduce the performance numbers to a definitional tautology or fitted-input prediction. The derivation chain therefore remains self-contained against the external expert benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE Robotics and Automation Letters 8(2), 1755–1762 (2023)
Anastasiou, D., Jin, Y., Stoyanov, D., Mazomenos, E.: Keep your eye on the best: Contrastive regression transformer for skill assessment in robotic surgery. IEEE Robotics and Automation Letters 8(2), 1755–1762 (2023)
work page 2023
-
[2]
International Journal of Computer Assisted Radiology and Surgery 18(7), 1279–1285 (2023)
Bkheet, E., DAngelo, A.L., Goldbraikh, A., Laufer, S.: Using hand pose estimation to automate open surgery training feedback. International Journal of Computer Assisted Radiology and Surgery 18(7), 1279–1285 (2023)
work page 2023
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visu- alization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 782–791 (2021)
work page 2021
-
[4]
The American journal of surgery 209(4), 645–651 (2015)
D’Angelo, A.L.D., Rutherford, D.N., Ray, R.D., Laufer, S., Kwan, C., Cohen, E.R., Mason, A., Pugh, C.M.: Idle time: an underdeveloped performance metric for as- sessing surgical skill. The American journal of surgery 209(4), 645–651 (2015)
work page 2015
-
[5]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. In: Proceedings of NAACL-HLT. pp. 4171–4186. Association for Computational Linguistics (2019)
work page 2019
-
[6]
Diaz, R., Marathe, A.: Soft labels for ordinal regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
work page 2019
-
[7]
International Journal of Computer Assisted Radiology and Surgery 14, 1217–1225 (2019)
Funke, I., Mees, S., Weitz, J., Speidel, S.: Video-based surgical skill assessment us- ing 3d convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery 14, 1217–1225 (2019)
work page 2019
-
[8]
Gao, Y., Vedula, S.S., Reiley, C.E., Ahmidi, N., Varadarajan, B., Lin, H.C., Tao, L., Zappella, L., Béjar, B., Yuh, D.D., et al.: Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In: MICCAI workshop: M2cai. vol. 3, p. 3 (2014)
work page 2014
-
[9]
International Journal of Computer As- sisted Radiology and Surgery 17, 437–448 (2022)
Goldbraikh, A., D’Angelo, A., Pugh, C., Laufer, S.: Video-based fully automatic assessment of open surgery suturing skills. International Journal of Computer As- sisted Radiology and Surgery 17, 437–448 (2022)
work page 2022
-
[10]
International Journal of Computer Assisted Radiology and Surgery (2024)
Hoffmann, H., Funke, I., Peters, P., Venkatesh, D., Egger, J., Rivoir, D., Röhrig, R., Hölzle, F., Bodenstedt, S., Willemer, M., Speidel, S., Puladi, B.: Aixsuture: vision-based assessment of open suturing skills. International Journal of Computer Assisted Radiology and Surgery (2024)
work page 2024
-
[11]
International journal of computer assisted radiology and surgery 14(9), 1611–1617 (2019)
Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Accurate and interpretable evaluation of surgical skills from kinematic data using fully con- volutional neural networks. International journal of computer assisted radiology and surgery 14(9), 1611–1617 (2019)
work page 2019
-
[12]
Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics YOLO (Jan 2023), https://github.com/ultralytics/ultralytics
work page 2023
-
[13]
NPJ digital medicine 5(1), 24 (2022)
Lam, K., Chen, J., Wang, Z., Iqbal, F.M., Darzi, A., Lo, B., Purkayastha, S., Kin- ross, J.M.: Machine learning for technical skill assessment in surgery: a systematic review. NPJ digital medicine 5(1), 24 (2022)
work page 2022
-
[14]
IEEE Trans- actions on Pattern Analysis and Machine Intelligence pp
Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: Ms-tcn++: Multi- stage temporal convolutional network for action segmentation. IEEE Trans- actions on Pattern Analysis and Machine Intelligence pp. 1–1 (2020). https://doi.org/10.1109/TPAMI.2020.3021756
-
[15]
In: Proceedings of the 31st International Conference on Neural Information Processing Systems
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 4768–4777. Curran Associates Inc. (2017) 10 Roi Papo et al
work page 2017
-
[16]
British journal of surgery 84(2), 273–278 (1997)
Martin, J., Regehr, G., Reznick, R., Macrae, H., Murnaghan, J., Hutchison, C., Brown, M.: Objective structured assessment of technical skill (osats) for surgical residents. British journal of surgery 84(2), 273–278 (1997)
work page 1997
-
[17]
arXiv preprint arXiv:2501.08115 (2025)
Papo, R., Gershov, S., Friedman, T., et al.: Rohan: Robust hand detection in operation room. arXiv preprint arXiv:2501.08115 (2025)
-
[18]
Electronics 13(8), 1430 (2024)
Perikos, I., Tzafilkou, K., Grivokostopoulou, F.: Bert-based transformers for sentiment analysis: A comparative study of explainability methods. Electronics 13(8), 1430 (2024). https://doi.org/10.3390/electronics13081430, https://doi.org/10.3390/electronics13081430
-
[19]
arXiv preprint arXiv:2409.12259 (2024)
Potamias, R.A., Zhang, J., Deng, J., Zafeiriou, S.: Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. arXiv preprint arXiv:2409.12259 (2024)
-
[20]
Analytical Chemistry 36(8), 1627–1639 (1964)
Savitzky, A., Golay, M.J.E.: Smoothing and differentiation of data by simpli- fied least squares procedures. Analytical Chemistry 36(8), 1627–1639 (1964). https://doi.org/10.1021/ac60214a047
-
[21]
In: Interna- tional conference on information processing in computer-assisted interventions
Tao, L., Elhamifar, E., Khudanpur, S., Hager, G.D., Vidal, R.: Sparse hidden markov models for surgical gesture classification and skill evaluation. In: Interna- tional conference on information processing in computer-assisted interventions. pp. 167–177. Springer (2012)
work page 2012
-
[22]
IEEE Transactions on Neural Networks and Learning Systems 32(11), 4793–4813 (2020)
Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE Transactions on Neural Networks and Learning Systems 32(11), 4793–4813 (2020)
work page 2020
-
[23]
Annual review of biomedical engineering 19, 301–325 (2017)
Vedula, S.S., Ishii, M., Hager, G.D.: Objective assessment of surgical technical skill and competency in the operating room. Annual review of biomedical engineering 19, 301–325 (2017)
work page 2017
-
[24]
The Journal of Defense Modeling and Simulation 19(2), 159– 171 (2022)
Yanik, E., Intes, X., Kruger, U., Yan, P., Diller, D., Van Voorst, B., Makled, B., Norfleet, J., De, S.: Deep neural networks for the assessment of surgical skills: A systematic review. The Journal of Defense Modeling and Simulation 19(2), 159– 171 (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.