Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

Alejandro Granados; Guotai Wang; Jingjing Peng; Ning Zhu; Sebastien Ourselin; Xiwu Chen; Yang Liu

arxiv: 2605.16387 · v1 · pith:YE3QXNCAnew · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

Yang Liu , Ning Zhu , Jingjing Peng , Xiwu Chen , Alejandro Granados , Guotai Wang , Sebastien Ourselin This is my paper

Pith reviewed 2026-05-20 22:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords online surgical phase recognitiontemporal stabilityerror cascadesevidence accumulationtemporal fragmentation indexsurgical video analysistemporal inferencedeep learning backbones

0 comments

The pith

Instability in online surgical phase recognition stems from early error cascades in temporal features and memoryless decisions that ignore evidence buildup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online surgical phase recognition models often reach high frame-by-frame accuracy yet produce jumpy predictions that break up the understanding of a surgery's flow. The paper traces this instability to two concrete mechanisms: an early mistake corrupts the model's ongoing temporal state and sends errors forward in a cascade, and phase changes actually require accumulating evidence over time while most systems decide on every frame independently. To counter both issues the authors build a single Train-Inference-Evaluation framework that adds a Temporal Error-Cascade loss during training to keep feature states stable, an Evidence-Gated Transition Predictor at inference time that permits a phase change only after enough evidence has collected, and a Temporal Fragmentation Index that measures the resulting reliability. When the components are attached to three different backbones and tested on the Cholec80 and AutoLaparo video sets, temporal fragmentation drops sharply while ordinary accuracy stays the same or improves slightly.

Core claim

The paper shows that observed fragmentation in online surgical phase recognition is produced by two linked mechanisms: early misclassifications that corrupt temporal feature states and then propagate forward as error cascades, plus a mismatch in which phase transitions follow evidence-accumulation rules while the systems themselves make memoryless frame-wise calls. It therefore introduces a unified Train-Inference-Evaluation framework whose training stage uses the Temporal Error-Cascade loss to suppress error onset and stabilize feature evolution, whose inference stage uses the Evidence-Gated Transition Predictor to allow state changes only when accumulated evidence exceeds a confidence gate

What carries the argument

The TEC loss that stabilizes temporal feature evolution during training together with the EGTP that gates phase transitions on accumulated evidence at inference time.

Load-bearing premise

That the two identified mechanisms dominate the observed fragmentation and that the TEC loss and EGTP can be inserted into existing backbones without lowering core accuracy or creating fresh instabilities.

What would settle it

An experiment in which the TEC loss and EGTP are added yet temporal fragmentation on Cholec80 remains high or frame-wise accuracy falls by more than a few percent.

Figures

Figures reproduced from arXiv: 2605.16387 by Alejandro Granados, Guotai Wang, Jingjing Peng, Ning Zhu, Sebastien Ourselin, Xiwu Chen, Yang Liu.

**Figure 2.** Figure 2: (a) Comparison of prediction visualization results w/ or w/o EGTP. (b) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Relationship between accuracy/stability versus [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Online Surgical Phase Recognition (SPR) models can reach high frame-wise accuracy, yet their predictions often lack temporal stability, fragmenting workflow understanding and reducing the reliability of downstream assistance. We show that this instability is not random noise but arises from two mechanisms: early misclassifications corrupt temporal feature states and propagate forward to form error cascades, and phase transitions follow evidence-accumulation dynamics whereas most online SPR systems rely on memoryless frame-wise decisions, making them sensitive to transient confidence fluctuations. We propose a unified Train-Inference-Evaluation framework that explicitly stabilizes temporal inference dynamics using model-agnostic, plug-and-play components. For training, the Temporal Error-Cascade (TEC) loss suppresses error onset and mitigates forward error propagation by stabilizing temporal feature evolution. For inference, the Evidence-Gated Transition Predictor (EGTP) enforces evidence-driven state transitions, allowing phase changes only when accumulated evidence exceeds a confidence boundary. For evaluation, we introduce the Temporal Fragmentation Index (TFI), a reliability-aware metric that quantifies instability-induced temporal disagreement beyond conventional frame-wise and token-based measures. Experiments on Cholec80 and AutoLaparo across three representative backbones show that the proposed framework substantially improves temporal stability and reduces prediction fragmentation, while maintaining or modestly improving frame-wise performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical plug-and-play framework to cut fragmentation in online surgical phase recognition by targeting error cascades and evidence accumulation, but the experiments do not isolate those mechanisms from generic temporal smoothing.

read the letter

The one thing to take away is that this paper offers a practical framework to make online surgical phase recognition more stable over time. It identifies error cascades from early mistakes and memoryless decisions as the sources of fragmentation, then adds a training loss, an inference gate, and a new metric to address them. What is new is the specific combination: Temporal Error-Cascade loss to limit propagation during training, Evidence-Gated Transition Predictor to enforce evidence-based changes at test time, and Temporal Fragmentation Index to measure the instability. These are designed to be added to existing models without much change. The experiments across Cholec80 and AutoLaparo with multiple backbones show reduced fragmentation while keeping or slightly boosting frame accuracy. This is solid for the subfield because it focuses on a real reliability issue in surgical AI and provides ready-to-use pieces. The idea that phase transitions need accumulated evidence rather than per-frame calls makes sense and matches how these systems are used. The soft spots are around the strength of the causal claims. The stress test is fair here: the paper does not include direct tests like injecting early errors to track cascade length before and after the loss, or comparing the gate to generic smoothing. The gains might come from general temporal regularization instead of the targeted fixes. If the full paper has more ablations that address this, that would help a lot. Otherwise it remains a question mark on how necessary the exact components are. Readers working on video analysis for medicine or real-time decision support would find this worth reading. It is the sort of incremental but useful work that refines existing approaches. It deserves a serious referee to check the details and suggest ways to strengthen the evidence for the mechanisms. I recommend putting it through peer review. The contribution is clear enough to warrant that step.

Referee Report

2 major / 2 minor

Summary. The paper claims that temporal instability in online surgical phase recognition arises from two specific mechanisms—early misclassifications corrupting temporal feature states and propagating as error cascades, plus phase transitions following evidence-accumulation dynamics while most systems use memoryless frame-wise decisions. It introduces a unified Train-Inference-Evaluation framework with the Temporal Error-Cascade (TEC) loss to stabilize temporal feature evolution during training, the Evidence-Gated Transition Predictor (EGTP) to enforce evidence-driven state transitions at inference, and the Temporal Fragmentation Index (TFI) as a new reliability-aware evaluation metric. Experiments on Cholec80 and AutoLaparo across three backbones report substantially improved temporal stability and reduced fragmentation while maintaining or modestly improving frame-wise accuracy.

Significance. If the identified mechanisms prove to be the dominant drivers and the proposed components specifically target them in a model-agnostic manner, the work could meaningfully improve the reliability of real-time surgical workflow assistance. The plug-and-play design and introduction of TFI address practical gaps in existing online SPR pipelines. The significance hinges on whether the gains are mechanistically attributable to the claimed dynamics rather than generic temporal regularization.

major comments (2)

[Experiments] Experiments section: the reported comparisons evaluate only the full TEC+EGTP+TFI system against baselines, without controlled interventions such as injecting early misclassifications and measuring cascade length or propagation distance before versus after TEC. This leaves open whether the stability gains arise from the hypothesized error-cascade suppression or from any form of temporal regularization.
[Inference component] Inference and evaluation: no ablation compares EGTP against a generic persistence or smoothing filter that also discourages transient flips. Without this, it is unclear whether the evidence-accumulation formulation is necessary or whether simpler temporal constraints would yield equivalent fragmentation reduction, weakening the claim that memoryless decisions are the core issue.

minor comments (2)

[Abstract] Abstract: the three representative backbones are not named; specifying them (e.g., in the first paragraph of the experiments) would improve immediate readability.
[Evaluation] The exact mathematical definition of the TFI and its relationship to existing token-based or frame-wise metrics should be stated explicitly, ideally with a short derivation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects of mechanistic validation for our proposed framework. We provide point-by-point responses below and commit to targeted revisions that strengthen the evidence for the claimed dynamics without altering the core claims of the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the reported comparisons evaluate only the full TEC+EGTP+TFI system against baselines, without controlled interventions such as injecting early misclassifications and measuring cascade length or propagation distance before versus after TEC. This leaves open whether the stability gains arise from the hypothesized error-cascade suppression or from any form of temporal regularization.

Authors: We agree that direct controlled interventions with injected early errors would offer stronger causal support for the error-cascade mechanism. Our existing ablations isolate the contribution of the TEC loss by comparing models trained with and without it, showing consistent reductions in fragmentation metrics that align with suppressed propagation. To address this gap explicitly, we will add a new controlled experiment subsection that simulates early misclassifications at varying rates and quantifies cascade length and propagation distance with versus without TEC, using the same backbones and datasets. revision: yes
Referee: [Inference component] Inference and evaluation: no ablation compares EGTP against a generic persistence or smoothing filter that also discourages transient flips. Without this, it is unclear whether the evidence-accumulation formulation is necessary or whether simpler temporal constraints would yield equivalent fragmentation reduction, weakening the claim that memoryless decisions are the core issue.

Authors: The referee is correct that a direct comparison to generic persistence or smoothing would better isolate the necessity of the evidence-gated formulation. While the manuscript already benchmarks against multiple temporal baselines (including LSTM and transformer variants with inherent smoothing), these do not specifically test a minimal persistence filter. We will incorporate an additional ablation in the revised experiments section that directly compares EGTP against (i) a persistence filter that retains the prior phase until a new prediction exceeds a fixed threshold and (ii) a simple exponential moving average smoother, reporting TFI and fragmentation metrics to demonstrate the added benefit of the evidence-accumulation boundary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper identifies two mechanisms of instability through empirical observation and introduces TEC loss for training, EGTP for inference, and TFI for evaluation as independent, model-agnostic additions to existing backbones. No equations, derivations, or self-citations in the abstract or framework description reduce the claimed stabilizations to fitted inputs, self-definitions, or prior author results by construction. Experiments compare the full system against baselines on Cholec80 and AutoLaparo, with gains presented as arising from the new components rather than renaming or forcing existing patterns. The central claims remain externally falsifiable via standard temporal metrics and do not collapse into the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the confidence boundary inside EGTP is mentioned but not quantified or derived.

pith-pipeline@v0.9.0 · 5771 in / 1209 out tokens · 77516 ms · 2026-05-20T22:10:55.241755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Nature Communications14(1), 6676 (2023)

Cao, J., Yip, H.C., Chen, Y., Scheppach, M., Luo, X., Yang, H., Cheng, M.K., Long, Y., Jin, Y., Chiu, P.W.Y., Yam, Y., Meng, H.M.L., Dou, Q.: Intelligent surgical workflow recognition for endoscopic submucosal dissection with real-time animal study. Nature Communications14(1), 6676 (2023)

work page 2023
[2]

Chen, Y., Wang, K.N., Tayupo, D., Huaulm’e, A., Timoh, K.N., Jannin, P., Dou, Q.: Dsted: Decoupling temporal stabilization and discriminative enhancement for surgical workflow recognition (2025)

work page 2025
[3]

Chen, Z., Luo, X., Wu, J., Bai, L., Lei, Z., Ren, H., Ourselin, S., Liu, H.: Surg- plan++: Universal surgical phase localization network for online and offline infer- ence (2025), https://arxiv.org/abs/2409.12467

work page arXiv 2025
[4]

In: International Congress Series

Cleary, K., Chung, H.Y., Mun, S.K.: Or2020 workshop overview: Operating room of the future. In: International Congress Series. vol. 1268, pp. 847–852. Elsevier (2004)

work page 2004
[5]

Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolu- tionalnetworks.In:MedicalImageComputingandComputerAssistedIntervention – MICCAI 2020. vol. 12263, pp. 343–352 (2020)

work page 2020
[6]

Inter- national Journal of Computer Assisted Radiology and Surgery11(6), 1081–1089 (2016)

Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. Inter- national Journal of Computer Assisted Radiology and Surgery11(6), 1081–1089 (2016)

work page 2016
[7]

Academic Medicine94(3), 427–439 (2019)

Dias, R.D., Gupta, A., Yule, S.J.: Using machine learning to assess physician com- petence: A systematic review. Academic Medicine94(3), 427–439 (2019)

work page 2019
[8]

Ding, H., Gao, Z., Planche, B., Luan, T., Sharma, A., Zheng, M., Lou, A., Chen, T., Unberath, M., Wu, Z.: Neural finite-state machines for surgical phase recognition (2025), https://arxiv.org/abs/2411.18018

work page arXiv 2025
[9]

International Journal of Computer Assisted Radiology and Surgery13, 1301–1308 (2018)

Franke, S., Rockstroh, M., Hofer, M., Neumuth, T.: The intelligent or: Design and validation of a context-aware surgical working environment. International Journal of Computer Assisted Radiology and Surgery13, 1301–1308 (2018)

work page 2018
[10]

In: Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA)

Gao, X., Jin, Y., Dou, Q., Heng, P.A.: Automatic gesture recognition in robot- assisted surgery with reinforcement learning and tree search. In: Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 8440–8446 (2020)

work page 2020
[11]

In: Medical Image Computing and Computer Assisted Intervention – MICCAI

Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.A.: Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: Medical Image Computing and Computer Assisted Intervention – MICCAI

work page
[12]

Annals of Surgery273(4), 684–693 (2021)

Garrow, C.R., Kowalewski, K.F., Li, L., Wagner, M., Schmidt, M.W., Engelhardt, S., Hashimoto, D.A., Kenngott, H.G., Bodenstedt, S., Speidel, S., Müller-Stich, B.P., Nickel, F.: Machine learning for surgical phase recognition: A systematic review. Annals of Surgery273(4), 684–693 (2021)

work page 2021
[13]

In: Proceedings of the 2021IEEE/CVFInternationalConferenceonComputerVision(ICCV).pp.13485– 13495

Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the 2021IEEE/CVFInternationalConferenceonComputerVision(ICCV).pp.13485– 13495. IEEE, Montreal, QC, Canada (October 2021)

work page 2021
[14]

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces (2024), https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

IEEE Transactions on Medical Imaging37(5), 1114–1126 (2018) Stabilizing Temporal Inference Dynamics for Online SPR 11

Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.W., Heng, P.A.: Sv-rcnet: Work- flow recognition from surgical videos using recurrent convolutional network. IEEE Transactions on Medical Imaging37(5), 1114–1126 (2018) Stabilizing Temporal Inference Dynamics for Online SPR 11

work page 2018
[16]

IEEE Transactions on Medical Imaging40(7), 1911–1923 (2021)

Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging40(7), 1911–1923 (2021)

work page 1911
[17]

Surgical Endoscopy 33, 3732–3740 (2019)

Kowalewski, K.F., Garrow, C.R., Schmidt, M.W., Benner, L., Müller-Stich, B.P., Nickel, F.: Sensor-based machine learning for workflow detection and as key to detect expert level in laparoscopic suturing and knot-tying. Surgical Endoscopy 33, 3732–3740 (2019)

work page 2019
[18]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1003–1012 (2017)

work page 2017
[19]

Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal cnns for fine-grained action segmentation (2016), https://arxiv.org/abs/1602.02995

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

In: Hua, G., Jégou, H

Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: Hua, G., Jégou, H. (eds.) Computer Vision – ECCV 2016 Workshops. pp. 47–54 (2016)

work page 2016
[21]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6647–6658 (2023)

Li, S., Farha, Y.A., Liu, Y., Cheng, M.M., Gall, J.: Ms-tcn++: Multi-stage tempo- ral convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6647–6658 (2023)

work page 2023
[22]

Medical Image Analysis99, 103366 (2025)

Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recog- nition. Medical Image Analysis99, 103366 (2025)

work page 2025
[23]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Y., Huo, J., Peng, J., Sparks, R., Dasgupta, P., Granados, A., Ourselin, S.: Skit: a fast key information video transformer for online surgical phase recognition. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 21017–21027 (2023)

work page 2023
[24]

IEEE Trans- actions on Medical Imaging34(4), 877–887 (2015)

Quellec, G., Lamard, M., Cochener, B., Cazuguel, G.: Real-time task recognition in cataract surgery videos using adaptive spatiotemporal polynomials. IEEE Trans- actions on Medical Imaging34(4), 877–887 (2015)

work page 2015
[25]

IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

work page 2017
[26]

In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. pp. 486–496 (2022)

work page 2022
[27]

Wu, H., Wang, T.H., Lechner, M., Hasani, R., Eckhoff, J.A., Pak, P., Meireles, O.R., Rosman, G., Ban, Y., Rus, D.: Holistic surgical phase recognition with hierarchical input dependent state space models (2025), https://arxiv.org/abs/2506.21330

work page arXiv 2025
[28]

In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

Yang, S., Luo, L., Wang, Q., Chen, H.: Surgformer: Surgical transformer with hierarchical temporal attention for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. pp. 606–616 (2024)

work page 2024
[29]

In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chel- lappa, R

Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chel- lappa, R. (eds.) Computer Vision – ACCV 2022. pp. 417–432 (2023)

work page 2022

[1] [1]

Nature Communications14(1), 6676 (2023)

Cao, J., Yip, H.C., Chen, Y., Scheppach, M., Luo, X., Yang, H., Cheng, M.K., Long, Y., Jin, Y., Chiu, P.W.Y., Yam, Y., Meng, H.M.L., Dou, Q.: Intelligent surgical workflow recognition for endoscopic submucosal dissection with real-time animal study. Nature Communications14(1), 6676 (2023)

work page 2023

[2] [2]

Chen, Y., Wang, K.N., Tayupo, D., Huaulm’e, A., Timoh, K.N., Jannin, P., Dou, Q.: Dsted: Decoupling temporal stabilization and discriminative enhancement for surgical workflow recognition (2025)

work page 2025

[3] [3]

Chen, Z., Luo, X., Wu, J., Bai, L., Lei, Z., Ren, H., Ourselin, S., Liu, H.: Surg- plan++: Universal surgical phase localization network for online and offline infer- ence (2025), https://arxiv.org/abs/2409.12467

work page arXiv 2025

[4] [4]

In: International Congress Series

Cleary, K., Chung, H.Y., Mun, S.K.: Or2020 workshop overview: Operating room of the future. In: International Congress Series. vol. 1268, pp. 847–852. Elsevier (2004)

work page 2004

[5] [5]

Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolu- tionalnetworks.In:MedicalImageComputingandComputerAssistedIntervention – MICCAI 2020. vol. 12263, pp. 343–352 (2020)

work page 2020

[6] [6]

Inter- national Journal of Computer Assisted Radiology and Surgery11(6), 1081–1089 (2016)

Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. Inter- national Journal of Computer Assisted Radiology and Surgery11(6), 1081–1089 (2016)

work page 2016

[7] [7]

Academic Medicine94(3), 427–439 (2019)

Dias, R.D., Gupta, A., Yule, S.J.: Using machine learning to assess physician com- petence: A systematic review. Academic Medicine94(3), 427–439 (2019)

work page 2019

[8] [8]

Ding, H., Gao, Z., Planche, B., Luan, T., Sharma, A., Zheng, M., Lou, A., Chen, T., Unberath, M., Wu, Z.: Neural finite-state machines for surgical phase recognition (2025), https://arxiv.org/abs/2411.18018

work page arXiv 2025

[9] [9]

International Journal of Computer Assisted Radiology and Surgery13, 1301–1308 (2018)

Franke, S., Rockstroh, M., Hofer, M., Neumuth, T.: The intelligent or: Design and validation of a context-aware surgical working environment. International Journal of Computer Assisted Radiology and Surgery13, 1301–1308 (2018)

work page 2018

[10] [10]

In: Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA)

Gao, X., Jin, Y., Dou, Q., Heng, P.A.: Automatic gesture recognition in robot- assisted surgery with reinforcement learning and tree search. In: Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 8440–8446 (2020)

work page 2020

[11] [11]

In: Medical Image Computing and Computer Assisted Intervention – MICCAI

Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.A.: Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: Medical Image Computing and Computer Assisted Intervention – MICCAI

work page

[12] [12]

Annals of Surgery273(4), 684–693 (2021)

Garrow, C.R., Kowalewski, K.F., Li, L., Wagner, M., Schmidt, M.W., Engelhardt, S., Hashimoto, D.A., Kenngott, H.G., Bodenstedt, S., Speidel, S., Müller-Stich, B.P., Nickel, F.: Machine learning for surgical phase recognition: A systematic review. Annals of Surgery273(4), 684–693 (2021)

work page 2021

[13] [13]

In: Proceedings of the 2021IEEE/CVFInternationalConferenceonComputerVision(ICCV).pp.13485– 13495

Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the 2021IEEE/CVFInternationalConferenceonComputerVision(ICCV).pp.13485– 13495. IEEE, Montreal, QC, Canada (October 2021)

work page 2021

[14] [14]

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces (2024), https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

IEEE Transactions on Medical Imaging37(5), 1114–1126 (2018) Stabilizing Temporal Inference Dynamics for Online SPR 11

Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.W., Heng, P.A.: Sv-rcnet: Work- flow recognition from surgical videos using recurrent convolutional network. IEEE Transactions on Medical Imaging37(5), 1114–1126 (2018) Stabilizing Temporal Inference Dynamics for Online SPR 11

work page 2018

[16] [16]

IEEE Transactions on Medical Imaging40(7), 1911–1923 (2021)

Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging40(7), 1911–1923 (2021)

work page 1911

[17] [17]

Surgical Endoscopy 33, 3732–3740 (2019)

Kowalewski, K.F., Garrow, C.R., Schmidt, M.W., Benner, L., Müller-Stich, B.P., Nickel, F.: Sensor-based machine learning for workflow detection and as key to detect expert level in laparoscopic suturing and knot-tying. Surgical Endoscopy 33, 3732–3740 (2019)

work page 2019

[18] [18]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1003–1012 (2017)

work page 2017

[19] [19]

Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal cnns for fine-grained action segmentation (2016), https://arxiv.org/abs/1602.02995

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

In: Hua, G., Jégou, H

Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: Hua, G., Jégou, H. (eds.) Computer Vision – ECCV 2016 Workshops. pp. 47–54 (2016)

work page 2016

[21] [21]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6647–6658 (2023)

Li, S., Farha, Y.A., Liu, Y., Cheng, M.M., Gall, J.: Ms-tcn++: Multi-stage tempo- ral convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(6), 6647–6658 (2023)

work page 2023

[22] [22]

Medical Image Analysis99, 103366 (2025)

Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recog- nition. Medical Image Analysis99, 103366 (2025)

work page 2025

[23] [23]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Liu, Y., Huo, J., Peng, J., Sparks, R., Dasgupta, P., Granados, A., Ourselin, S.: Skit: a fast key information video transformer for online surgical phase recognition. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 21017–21027 (2023)

work page 2023

[24] [24]

IEEE Trans- actions on Medical Imaging34(4), 877–887 (2015)

Quellec, G., Lamard, M., Cochener, B., Cazuguel, G.: Real-time task recognition in cataract surgery videos using adaptive spatiotemporal polynomials. IEEE Trans- actions on Medical Imaging34(4), 877–887 (2015)

work page 2015

[25] [25]

IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

work page 2017

[26] [26]

In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. pp. 486–496 (2022)

work page 2022

[27] [27]

Wu, H., Wang, T.H., Lechner, M., Hasani, R., Eckhoff, J.A., Pak, P., Meireles, O.R., Rosman, G., Ban, Y., Rus, D.: Holistic surgical phase recognition with hierarchical input dependent state space models (2025), https://arxiv.org/abs/2506.21330

work page arXiv 2025

[28] [28]

In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024

Yang, S., Luo, L., Wang, Q., Chen, H.: Surgformer: Surgical transformer with hierarchical temporal attention for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. pp. 606–616 (2024)

work page 2024

[29] [29]

In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chel- lappa, R

Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Wang, L., Gall, J., Chin, T.J., Sato, I., Chel- lappa, R. (eds.) Computer Vision – ACCV 2022. pp. 417–432 (2023)

work page 2022