Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

Huanrong Liu; Jiaheng Dai; Qingbiao Li; Qin Liu; Tailai Zhou; Tongyu Jia; Xin Ma; Yu Gao; Yutong Ban; Zeju Li

arxiv: 2604.09051 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.RO

Fine-Grained Action Segmentation for Renorrhaphy in Robot-Assisted Partial Nephrectomy

Jiaheng Dai , Huanrong Liu , Tailai Zhou , Tongyu Jia , Qin Liu , Yutong Ban , Zeju Li , Yu Gao

show 2 more authors

Xin Ma Qingbiao Li

This is my paper

Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords fine-grained action segmentationrenorrhaphyrobot-assisted partial nephrectomytemporal action segmentationsurgical video analysisSIA-RAPN benchmarkDiffActI3D features

0 comments

The pith

A benchmark of 50 videos shows DiffAct leads most metrics while MS-TCN++ leads balanced accuracy for segmenting 12 renorrhaphy actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new benchmark problem for frame-by-frame recognition of visually similar suturing gestures that vary in duration and frequency during the renorrhaphy phase of robot-assisted partial nephrectomy. It assembles 50 clinical videos recorded with the da Vinci Xi system and supplies 12 consistent action labels per frame to create the SIA-RAPN dataset. Four temporal segmentation networks built on I3D features are then trained and tested across five released data splits, with additional cross-domain evaluation on a separate single-port RAPN collection. The comparison uses balanced accuracy, edit score, segmental F1 at three overlap thresholds, frame-wise accuracy, and frame mAP to quantify how well each model copes with class imbalance and variable gesture lengths. Establishing this benchmark supplies a concrete testbed for computational methods that could later support surgical training review or intra-operative guidance.

Core claim

On the SIA-RAPN benchmark, DiffAct records the highest segmental F1, frame-wise accuracy, edit score, and frame mAP across the strongest runs over the five splits, while MS-TCN++ records the highest balanced accuracy. The same four models, all using I3D features, are further evaluated on an independent single-port RAPN dataset to measure cross-domain behavior.

What carries the argument

The SIA-RAPN benchmark of 50 da Vinci Xi videos annotated at the frame level with 12 renorrhaphy action classes, together with the comparative evaluation of the temporal models MS-TCN++, AsFormer, TUT, and DiffAct.

If this is right

DiffAct supplies the strongest segmentation performance on the defined renorrhaphy actions under the reported metrics.
MS-TCN++ supplies the best balanced accuracy when class imbalance is the dominant concern.
The benchmark and its five splits enable direct comparison of any future temporal model on this clinical task.
Cross-domain results on single-port RAPN videos provide an initial measure of how well the learned representations transfer to a different procedural variant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If segmentation accuracy continues to improve, the outputs could be used to generate automated post-case summaries for surgeon training.
Adding robot kinematic streams or tool-pose tracks to the video features might reduce confusion between visually similar gestures.
Scaling the benchmark to hundreds of cases from multiple centers would test whether current performance gaps persist under greater clinical variability.

Load-bearing premise

The 12 frame-level action annotations are consistent, complete, and free of selection bias across all videos and the five released splits.

What would settle it

Independent re-annotation of the same 50 videos by a new set of surgeons, followed by re-training and re-evaluation on the same five splits, produces a different ranking in which DiffAct no longer leads on F1 or edit score.

read the original abstract

Fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy requires frame-level recognition of visually similar suturing gestures with variable duration and substantial class imbalance. The SIA-RAPN benchmark defines this problem on 50 clinical videos acquired with the da Vinci Xi system and annotated with 12 frame-level labels. The benchmark compares four temporal models built on I3D features: MS-TCN++, AsFormer, TUT, and DiffAct. Evaluation uses balanced accuracy, edit score, segmental F1 at overlap thresholds of 10, 25, and 50, frame-wise accuracy, and frame-wise mean average precision. In addition to the primary evaluation across five released split configurations on SIA-RAPN, the benchmark reports cross-domain results on a separate single-port RAPN dataset. Across the strongest reported values over those five runs on the primary dataset, DiffAct achieves the highest F1, frame-wise accuracy, edit score, and frame mAP, while MS-TCN++ attains the highest balanced accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper mainly adds a new 50-video dataset for fine-grained renorrhaphy gesture segmentation in robotic partial nephrectomy and runs standard temporal models on it.

read the letter

The core contribution is the SIA-RAPN benchmark: 50 da Vinci Xi videos labeled at frame level with 12 specific renorrhaphy actions, plus a comparison of MS-TCN++, AsFormer, TUT, and DiffAct using I3D features. DiffAct comes out ahead on F1, edit score, frame accuracy, and mAP across the five splits, while MS-TCN++ leads on balanced accuracy. They also test cross-domain on a single-port set. That dataset release is the actual new piece; the models and metrics are off-the-shelf.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the SIA-RAPN benchmark for fine-grained action segmentation during renorrhaphy in robot-assisted partial nephrectomy. It consists of 50 da Vinci Xi clinical videos annotated at the frame level with 12 suturing gesture labels. Four temporal models (MS-TCN++, AsFormer, TUT, DiffAct) are evaluated on I3D features using balanced accuracy, edit score, segmental F1@10/25/50, frame-wise accuracy, and frame mAP. The central empirical claim is that, across the strongest run on each of the five released splits, DiffAct attains the highest F1, frame-wise accuracy, edit score, and frame mAP while MS-TCN++ leads in balanced accuracy; cross-domain results on a separate single-port RAPN dataset are also presented.

Significance. If the ground-truth annotations prove reliable and the five splits adequately capture clinical variability, the benchmark would supply a useful public resource for fine-grained surgical gesture recognition, a domain where class imbalance and variable action durations remain challenging. The head-to-head comparison of recent temporal models and the inclusion of cross-domain evaluation constitute concrete empirical contributions that could guide future work in computer-assisted surgery.

major comments (3)

[§3.1] §3.1 (Dataset and Annotation): No inter-rater agreement statistics, annotation protocol, or quantification of label noise are reported for the 12 fine-grained labels. Given that the central ranking of DiffAct versus MS-TCN++ rests on frame-level ground truth for visually similar gestures, the absence of these details makes it impossible to judge whether the reported metric differences reflect modeling advances or annotation artifacts.
[§4.2] §4.2 (Experimental Setup and Results): The manuscript reports only the strongest value across the five splits for each metric without mean and standard deviation. Because the claim that DiffAct is superior on four of five metrics depends on this ordering being stable, the lack of variability statistics prevents assessment of whether the ranking is robust or an artifact of particular data partitions.
[§4.3] §4.3 (Implementation Details): No information is given on hyperparameter selection, training schedules, or explicit handling of class imbalance for any of the four models. These choices are load-bearing for reproducing the comparative results and for determining whether DiffAct’s reported advantages arise from architectural merits or from favorable tuning on the SIA-RAPN splits.

minor comments (2)

[Abstract] The abstract would benefit from explicitly stating the total number of videos (50) and labels (12) to give readers an immediate sense of scale.
[Results tables] Table captions and axis labels in the results section should clarify whether reported F1 scores are segmental or frame-wise to avoid ambiguity with the separate frame mAP metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of reproducibility and reliability that we will address in the revision. Below we respond point by point to each major comment.

read point-by-point responses

Referee: [§3.1] §3.1 (Dataset and Annotation): No inter-rater agreement statistics, annotation protocol, or quantification of label noise are reported for the 12 fine-grained labels. Given that the central ranking of DiffAct versus MS-TCN++ rests on frame-level ground truth for visually similar gestures, the absence of these details makes it impossible to judge whether the reported metric differences reflect modeling advances or annotation artifacts.

Authors: We agree that annotation reliability is critical for interpreting the results. In the revised manuscript we will add a full description of the annotation protocol, including label definitions, the annotation interface, and the process followed by the expert annotator. We will also include a quantification of label noise based on the observed annotation variability. However, inter-rater agreement statistics cannot be reported because all annotations were performed by a single expert surgeon, a standard practice for fine-grained surgical gesture datasets given the required domain expertise. We will explicitly note this as a limitation. revision: partial
Referee: [§4.2] §4.2 (Experimental Setup and Results): The manuscript reports only the strongest value across the five splits for each metric without mean and standard deviation. Because the claim that DiffAct is superior on four of five metrics depends on this ordering being stable, the lack of variability statistics prevents assessment of whether the ranking is robust or an artifact of particular data partitions.

Authors: We acknowledge that reporting only the best-run values limits evaluation of robustness. In the revised manuscript we will report mean and standard deviation across all five splits for every metric and model. This will allow readers to assess the stability of the observed performance ordering between DiffAct and the other methods. revision: yes
Referee: [§4.3] §4.3 (Implementation Details): No information is given on hyperparameter selection, training schedules, or explicit handling of class imbalance for any of the four models. These choices are load-bearing for reproducing the comparative results and for determining whether DiffAct’s reported advantages arise from architectural merits or from favorable tuning on the SIA-RAPN splits.

Authors: We will expand the implementation details section to include the specific hyperparameter values selected for each model, the full training schedules (epochs, learning-rate schedules, and optimizers), and the explicit strategies used to handle class imbalance (weighted losses and/or sampling). These additions will support reproducibility and clarify the source of the reported performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison with no derivations or self-referential predictions

full rationale

The paper presents an empirical evaluation of four off-the-shelf temporal action segmentation models (MS-TCN++, AsFormer, TUT, DiffAct) on the SIA-RAPN dataset using standard metrics (balanced accuracy, edit score, F1, frame accuracy, mAP). No first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Model rankings are direct outputs of running the models on the released splits; they do not reduce to the inputs by construction. The work is self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are present; the paper is an empirical dataset and model comparison.

pith-pipeline@v0.9.0 · 5508 in / 1206 out tokens · 48108 ms · 2026-05-10T17:14:07.399060+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Robotic-assisted surgery for the treatment of urologic cancers: recent advances.Expert Review of Medical Devices, 17(6):579–590, 2020

Ugo Falagario, Alessandro Veccia, Samuel Weprin, Emanuel V Albuquerque, William C Nahas, Giuseppe Carrieri, Vito Pansadoro, Lance J Hampton, Francesco Porpiglia, and Riccardo Au- torino. Robotic-assisted surgery for the treatment of urologic cancers: recent advances.Expert Review of Medical Devices, 17(6):579–590, 2020

work page 2020
[2]

Suturing techniques in robot-asssisted partial nephrectomy (rapn)

Hannah Van Puyvelde and Ruben De Groote. Suturing techniques in robot-asssisted partial nephrectomy (rapn). InRobotic Surgery for Renal Cancer, pages 1–5. Springer, 2023

work page 2023
[3]

Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling

Yixin Gao, S Swaroop Vedula, Carol E Reiley, Narges Ahmidi, Balakrishnan Varadarajan, Henry C Lin, Lingling Tao, Luca Zappella, Benjamın Béjar, David D Yuh, et al. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. InMICCAI workshop: M2cai, page 3, 2014

work page 2014
[4]

Danit Itzkovich, Yarden Sharon, Anthony Jarc, Yael Refaely, and Ilana Nisky. Generalization of deep learning gesture classification in robotic-assisted surgical data: From dry lab to clinical- like data.IEEE Journal of Biomedical and Health Informatics, 26(3):1329–1340, 2021

work page 2021
[5]

Sar-rarp50: Segmentation of surgical instrumentation and action recognition on robot-assisted radical prostatectomy challenge.arXiv preprint arXiv:2401.00496, 2023

Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, et al. Sar-rarp50: Seg- mentation of surgical instrumentation and action recognition on robot-assisted radical prosta- tectomy challenge.arXiv preprint arXiv:2401.00496, 2023

work page arXiv 2023
[6]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[7]

Ms-tcn++: Multi-stagetemporalconvolutional network for action segmentation.IEEE Trans

YLiu, MMCheng, SJLi, YAFarha, andJGall. Ms-tcn++: Multi-stagetemporalconvolutional network for action segmentation.IEEE Trans. Pattern Analysis and Machine Intelligence, pages 1–1, 2020

work page 2020
[8]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568, 2021

work page arXiv 2021
[9]

Do we really need temporal convolutions in action segmentation? In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1014–1019

Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, and Ying Shan. Do we really need temporal convolutions in action segmentation? In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1014–1019. IEEE, 2023

work page 2023
[10]

Dif- fact++: Diffusion action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1644–1659, 2024

Daochang Liu, Qiyue Li, Anh-Dung Dinh, Tingting Jiang, Mubarak Shah, and Chang Xu. Dif- fact++: Diffusion action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1644–1659, 2024

work page 2024
[11]

The balanced accuracy and its posterior distribution

Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann. The balanced accuracy and its posterior distribution. In2010 20th international conference on pattern recognition, pages 3121–3124. IEEE, 2010

work page 2010
[12]

Temporal convolutional networks for action segmentation and detection

Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017

work page 2017
[13]

The relationship between precision-recall and roc curves

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006. 9

work page 2006

[1] [1]

Robotic-assisted surgery for the treatment of urologic cancers: recent advances.Expert Review of Medical Devices, 17(6):579–590, 2020

Ugo Falagario, Alessandro Veccia, Samuel Weprin, Emanuel V Albuquerque, William C Nahas, Giuseppe Carrieri, Vito Pansadoro, Lance J Hampton, Francesco Porpiglia, and Riccardo Au- torino. Robotic-assisted surgery for the treatment of urologic cancers: recent advances.Expert Review of Medical Devices, 17(6):579–590, 2020

work page 2020

[2] [2]

Suturing techniques in robot-asssisted partial nephrectomy (rapn)

Hannah Van Puyvelde and Ruben De Groote. Suturing techniques in robot-asssisted partial nephrectomy (rapn). InRobotic Surgery for Renal Cancer, pages 1–5. Springer, 2023

work page 2023

[3] [3]

Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling

Yixin Gao, S Swaroop Vedula, Carol E Reiley, Narges Ahmidi, Balakrishnan Varadarajan, Henry C Lin, Lingling Tao, Luca Zappella, Benjamın Béjar, David D Yuh, et al. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. InMICCAI workshop: M2cai, page 3, 2014

work page 2014

[4] [4]

Danit Itzkovich, Yarden Sharon, Anthony Jarc, Yael Refaely, and Ilana Nisky. Generalization of deep learning gesture classification in robotic-assisted surgical data: From dry lab to clinical- like data.IEEE Journal of Biomedical and Health Informatics, 26(3):1329–1340, 2021

work page 2021

[5] [5]

Sar-rarp50: Segmentation of surgical instrumentation and action recognition on robot-assisted radical prostatectomy challenge.arXiv preprint arXiv:2401.00496, 2023

Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, et al. Sar-rarp50: Seg- mentation of surgical instrumentation and action recognition on robot-assisted radical prosta- tectomy challenge.arXiv preprint arXiv:2401.00496, 2023

work page arXiv 2023

[6] [6]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017

[7] [7]

Ms-tcn++: Multi-stagetemporalconvolutional network for action segmentation.IEEE Trans

YLiu, MMCheng, SJLi, YAFarha, andJGall. Ms-tcn++: Multi-stagetemporalconvolutional network for action segmentation.IEEE Trans. Pattern Analysis and Machine Intelligence, pages 1–1, 2020

work page 2020

[8] [8]

Asformer: Transformer for action segmentation

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568, 2021

work page arXiv 2021

[9] [9]

Do we really need temporal convolutions in action segmentation? In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1014–1019

Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, and Ying Shan. Do we really need temporal convolutions in action segmentation? In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1014–1019. IEEE, 2023

work page 2023

[10] [10]

Dif- fact++: Diffusion action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1644–1659, 2024

Daochang Liu, Qiyue Li, Anh-Dung Dinh, Tingting Jiang, Mubarak Shah, and Chang Xu. Dif- fact++: Diffusion action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):1644–1659, 2024

work page 2024

[11] [11]

The balanced accuracy and its posterior distribution

Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann. The balanced accuracy and its posterior distribution. In2010 20th international conference on pattern recognition, pages 3121–3124. IEEE, 2010

work page 2010

[12] [12]

Temporal convolutional networks for action segmentation and detection

Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017

work page 2017

[13] [13]

The relationship between precision-recall and roc curves

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006. 9

work page 2006