Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

Ayah Al-Naji; Cesare Stefanini; Edoardo Fazzari; Hamdan Alhadhrami; Ivan Laptev; Khalfan Hableel; Omar Mohamed; Saif AlKindi

arxiv: 2602.24138 · v2 · pith:DGRYKRXTnew · submitted 2026-02-27 · 💻 cs.CV · cs.AI

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

Omar Mohamed , Edoardo Fazzari , Ayah Al-Naji , Hamdan Alhadhrami , Khalfan Hableel , Saif Alkindi , Ivan Laptev , Cesare Stefanini This is my paper

Pith reviewed 2026-05-21 11:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords surgical phase segmentationoptimal transportzero-shot learningmultimodal fusiontemporal segmentationrobotic surgerytraining-free methods

0 comments

The pith

TASOT performs training-free surgical temporal segmentation by fusing video visuals with automatically generated text descriptions inside an unbalanced optimal transport objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TASOT as an annotation-free method for dividing surgical videos into phases and steps. It extends prior optimal transport work by generating text captions from the raw video itself, aligning those captions to frames, and combining them with visual features in one transport problem. This produces large accuracy gains on several laparoscopic and robotic datasets compared with existing zero-shot approaches. A reader would care because most current systems need large labeled surgical datasets or heavy pretraining, which blocks use on new platforms and in varied clinical settings.

Core claim

TASOT extends the Action Segmentation Optimal Transport formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3 while temporal captions from a vision-language model are encoded via CLIP and aligned to frames. The method reports F1-score gains of 18.9 on Cholec80, 33.7 on AutoLaparo, 23.7 on StrasByPass70, and 4.5 on BernByPass70 over the strongest zero-shot baselines.

What carries the argument

The unified unbalanced Gromov-Wasserstein optimal transport objective that integrates DINOv3 visual features with CLIP-encoded temporally aligned text captions.

If this is right

Surgical workflow segmentation becomes feasible on new robotic platforms without collecting or annotating task-specific videos.
Intraoperative decision support and automation can rely on raw video input alone.
Skill assessment and workflow analysis scale across laparoscopic and robotic procedures without domain-specific pretraining.
The same multimodal transport setup can be applied to other temporal segmentation benchmarks that currently require labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If caption quality drives the gains, swapping in stronger vision-language models should produce further measurable lifts on the same datasets.
The frame-level alignment step could be tested for extension to finer-grained surgical step recognition rather than phase-level labels.
Similar text-augmented transport might transfer to non-medical video domains such as sports action segmentation or surveillance event detection.

Load-bearing premise

The textual descriptions generated from the input video by a vision-language model supply accurate complementary semantic information without introducing substantial misalignment or hallucination errors.

What would settle it

A controlled ablation that removes the text component from the transport cost and measures whether the reported F1 gains disappear or reverse on the same datasets.

Figures

Figures reproduced from arXiv: 2602.24138 by Ayah Al-Naji, Cesare Stefanini, Edoardo Fazzari, Hamdan Alhadhrami, Ivan Laptev, Khalfan Hableel, Omar Mohamed, Saif AlKindi.

**Figure 1.** Figure 1: Overview of TASOT Surgical videos are divided into temporal windows and processed by a vision–language model to generate structured temporal captions. Visual features (DINOv3) and temporally aligned textual features (CLIP) are integrated within the TASOT model, where a weighted multimodal cost is used for unsupervised surgical temporal segmentation. 3 Methods We introduce TASOT (Text-Augmented Action Segm… view at source ↗

**Figure 2.** Figure 2: Qualitative results We display the ground-truth (GT) and the results of TASOT (Ours) from the MultiByPass140 dataset. ularly for step-level segmentation. Specifically, performance on BernBypass140 increases from 23.0 to 48.8, surpassing even the supervised baseline, while results on StrasBypass140 improve from 30.7 to 52.4, nearly closing the gap with the unsupervised upper bound. 5 Conclusion We introduce… view at source ↗

read the original abstract

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TASOT adds VLM captions to ASOT inside unbalanced Gromov-Wasserstein transport for annotation-free surgical phase segmentation and reports large F1 gains, but the gains rest on untested caption reliability in the surgical domain.

read the letter

The core of this paper is a straightforward extension: they take the prior ASOT optimal transport setup for action segmentation and add temporally aligned text captions generated by a VLM, encoded with CLIP, then fuse them with DINOv3 visual features in one unbalanced Gromov-Wasserstein objective. The result is a training-free method that claims big lifts over zero-shot baselines across Cholec80, AutoLaparo, StrasByPass70, and BernByPass70. That multimodal addition inside the transport cost is the actual new piece, and evaluating it on several public surgical video sets is a reasonable way to test generality without labels or domain pretraining. Using standard pretrained extractors keeps the approach deployable, which matters for robotic systems where annotation is expensive. The formulation builds directly on existing OT work, so the technical grounding looks solid on paper. The soft spot is exactly the one the stress-test flags. Surgical videos have narrow, domain-specific actions and terminology that general VLMs rarely see, so the captions could easily misalign or hallucinate phases. That would directly corrupt the fused cost matrix and could account for some or all of the reported gains. The abstract and high-level description give no ablations that remove the text modality, no direct comparison of caption accuracy against ground-truth phases, and no error analysis on the transport plans with versus without text. Without those, it is difficult to tell whether the multimodal fusion is doing the work or whether other implementation choices are driving the numbers. The paper is aimed at people working on zero-shot or annotation-light methods for surgical workflow analysis. A reader already familiar with optimal transport or multimodal video segmentation would get the most out of it. It is worth sending to a serious referee because the performance deltas are large enough to check and the core idea is simple to reproduce if the missing details and controls are supplied.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TASOT, an annotation-free extension of the Action Segmentation Optimal Transport (ASOT) framework for surgical temporal segmentation. It augments the unbalanced Gromov-Wasserstein objective with temporally aligned textual captions generated by a vision-language model from the input video (encoded via CLIP) fused with DINOv3 visual features, and reports large F1 gains over zero-shot baselines on Cholec80 (+18.9), AutoLaparo (+33.7), StrasByPass70 (+23.7), and BernByPass70 (+4.5).

Significance. If the central claim holds, the work would offer a meaningful advance toward practical, training-free surgical workflow analysis across robotic platforms by leveraging public datasets and avoiding domain-specific pretraining. The multimodal extension of ASOT and the breadth of evaluated benchmarks (laparoscopic and robotic procedures) are positive aspects that would strengthen the contribution if the text modality's reliability is substantiated.

major comments (3)

[§3] §3 (Method): The description of the unified unbalanced Gromov-Wasserstein objective and the visual-text fusion mechanism lacks the explicit cost-matrix formulation and the precise role of the visual-text fusion weight; without these equations it is impossible to verify that the reported gains arise from complementary semantic structure rather than from an additional tunable hyperparameter or from the base ASOT component alone.
[§4] §4 (Experiments) and associated result tables: The main quantitative tables report substantial F1 improvements, yet no ablation isolating the text modality, no quantitative metrics on VLM caption accuracy or temporal alignment error relative to ground-truth phases, and no transport-plan comparison with versus without the CLIP component are provided; these omissions directly undermine confidence that the +18.9–33.7 F1 gains are attributable to reliable multimodal fusion in the surgical domain.
[§4.2] §4.2 or §5 (Discussion): The load-bearing assumption that VLM-generated captions supply domain-appropriate semantic cues without systematic hallucination or phase misalignment is stated but not tested; given that surgical actions and terminology lie far outside typical VLM training distributions, the absence of any error analysis or sensitivity study on caption quality leaves the central claim vulnerable.

minor comments (2)

[Abstract] Abstract: The claim of evaluation on 'three public surgical datasets and four benchmark settings' is inconsistent with the four datasets explicitly named; clarify the exact dataset count and benchmark definitions.
[Notation] Notation: Ensure that symbols for the unbalanced Gromov-Wasserstein distance, the fusion weight, and the temporal alignment operator are defined once and used consistently between the method equations and the experimental description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects for strengthening the clarity and validation of our TASOT framework. We address each major comment point by point below and will revise the manuscript accordingly to incorporate the requested details and analyses.

read point-by-point responses

Referee: [§3] §3 (Method): The description of the unified unbalanced Gromov-Wasserstein objective and the visual-text fusion mechanism lacks the explicit cost-matrix formulation and the precise role of the visual-text fusion weight; without these equations it is impossible to verify that the reported gains arise from complementary semantic structure rather than from an additional tunable hyperparameter or from the base ASOT component alone.

Authors: We agree that the method section would benefit from greater mathematical precision. The original submission described the fusion at a conceptual level but did not provide the explicit cost-matrix equations or the role of the fusion weight. In the revised manuscript we will add the full formulation of the unbalanced Gromov-Wasserstein objective, explicitly defining the fused cost matrix as a convex combination C = λ C_visual + (1-λ) C_text (where C_visual is computed from DINOv3 frame features and C_text from CLIP embeddings of temporally aligned VLM captions) and stating that λ is a fixed hyperparameter chosen via a small validation sweep. We will also include a brief sensitivity plot for λ to demonstrate that the reported gains are not an artifact of a single tuned value. revision: yes
Referee: [§4] §4 (Experiments) and associated result tables: The main quantitative tables report substantial F1 improvements, yet no ablation isolating the text modality, no quantitative metrics on VLM caption accuracy or temporal alignment error relative to ground-truth phases, and no transport-plan comparison with versus without the CLIP component are provided; these omissions directly undermine confidence that the +18.9–33.7 F1 gains are attributable to reliable multimodal fusion in the surgical domain.

Authors: We acknowledge that the current experimental section does not contain the requested ablations and supporting analyses. In the revision we will add (i) a direct ablation of TASOT versus the original ASOT baseline (i.e., with and without the text modality), (ii) quantitative caption-quality metrics (e.g., temporal alignment error measured against ground-truth phase boundaries and semantic similarity scores where reference descriptions exist), and (iii) side-by-side visualizations or quantitative comparisons of the learned transport plans with and without the CLIP component. These additions will allow readers to attribute the F1 gains specifically to the multimodal fusion. revision: yes
Referee: [§4.2] §4.2 or §5 (Discussion): The load-bearing assumption that VLM-generated captions supply domain-appropriate semantic cues without systematic hallucination or phase misalignment is stated but not tested; given that surgical actions and terminology lie far outside typical VLM training distributions, the absence of any error analysis or sensitivity study on caption quality leaves the central claim vulnerable.

Authors: We recognize the importance of directly validating the reliability of VLM captions in the surgical domain. While the consistent gains across four benchmarks provide indirect support, we agree that explicit error analysis is warranted. In the revised discussion we will add a sensitivity study that varies the VLM (and prompt) and reports observed caption error rates and phase misalignment statistics. We will also discuss the limitations arising from potential hallucinations and how the unbalanced transport formulation partially mitigates them. revision: yes

Circularity Check

0 steps flagged

No circularity: extension of prior ASOT uses external pre-trained models and reports results on public benchmarks

full rationale

The paper defines TASOT by extending the existing ASOT formulation with multimodal inputs from DINOv3 and CLIP (pre-trained external models) inside an unbalanced Gromov-Wasserstein objective, then evaluates the resulting method on independent public datasets (Cholec80, AutoLaparo, StrasByPass70, BernByPass70) against zero-shot baselines. No equations or steps reduce by construction to fitted parameters, self-definitions, or unverified self-citations; the performance deltas are measured outcomes rather than tautological outputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on pre-trained DINOv3 and CLIP models treated as fixed feature extractors and on the prior ASOT formulation. No new physical entities are postulated. A likely free parameter is the relative weighting of visual versus textual costs inside the transport objective.

free parameters (1)

visual-text fusion weight
Hyperparameter balancing the contribution of CLIP text embeddings against DINOv3 visual features in the unified transport cost; inferred from the multimodal fusion description.

axioms (1)

domain assumption Temporally aligned captions generated by a vision-language model supply reliable complementary semantic structure for surgical phase matching
Invoked when the paper states that textual cues provide complementary semantic structure to the transport cost.

pith-pipeline@v0.9.0 · 5826 in / 1541 out tokens · 64113 ms · 2026-05-21T11:33:27.294084+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions ... within a unified unbalanced Gromov-Wasserstein optimal transport objective. ... C_{i,k} = β C_img_{i,k} + (1-β) C_text_{i,k}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use modest off-the-shelf encoders ... no task-specific supervised pretraining and no massive backbone architectures tailored to surgical data.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2512.10942 (2025)

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: Vl-jepa: Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942 (2025)

work page arXiv 2025
[2]

In: International conference on medical image computing and computer-assisted intervention

Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convo- lutional networks. In: International conference on medical image computing and computer-assisted intervention. pp. 343–352. Springer (2020)

work page 2020
[3]

IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024)

Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of mod- ern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024). https://doi.org/10.1109/TPAMI.2023.3327284

work page doi:10.1109/tpami.2023.3327284 2024
[4]

International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)

Fazzari, E., Romano, D., Falchi, F., Stefanini, C.: Artemis: animal recognition through enhanced multimodal integration system. International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)

work page 2025
[5]

Research Square (2026)

Fazzari, E., Stefanini, C.: Deep reinforcement learning for surgical robotics with state and image information: A survey. Research Square (2026). https://doi.org/10.21203/rs.3.rs-8621244/v1

work page doi:10.21203/rs.3.rs-8621244/v1 2026
[6]

Google DeepMind: Gemini 2.0 flash model card. Tech. rep., Google (April 2025), available athttps://modelcards.withgoogle.com/assets/documents/ gemini-2-flash.pdf

work page 2025
[7]

Surgical Endoscopy37(8), 6588–6601 (2023)

Hashemi,N.,Svendsen,M.B.S.,Bjerrum,F.,Rasmussen,S.,Tolsgaard,M.G.,Friis, M.L.: Acquisition and usage of robotic surgical data for machine learning analysis. Surgical Endoscopy37(8), 6588–6601 (2023)

work page 2023
[8]

International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)

Lavanchy, J.L., Ramesh, S., Dall’Alba, D., Gonzalez, C., Fiorini, P., Müller-Stich, B.P., Nett, P.C., Marescaux, J., Mutter, D., Padoy, N.: Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)

work page 2024
[9]

Medical Image Analysis99, 103366 (2025)

Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recog- nition. Medical Image Analysis99, 103366 (2025)

work page 2025
[10]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Miech,A.,Alayrac,J.B.,Smaira,L.,Laptev,I.,Sivic,J.,Zisserman,A.:End-to-end learning of visual representations from uncurated instructional videos. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9879–9889 (2020)

work page 2020
[11]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[12]

In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)

Ramesh, S., Dall’Alba, D., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Fiorini, P., Padoy, N.: Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)

work page 2021
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sarfraz, S., Murray, N., Sharma, V., Diba, A., Van Gool, L., Stiefelhagen, R.: Temporally-weighted hierarchical clustering for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11225–11234 (2021) 10 O. Mohamed et al

work page 2021
[14]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8368–8376 (2018)

work page 2018
[15]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Spurio, F., Bahrami, E., Francesca, G., Gall, J.: Hierarchical vector quantization for unsupervised action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 6996–7005 (2025)

work page 2025
[17]

In: 2021 IEEE International Conference on Image Processing (ICIP)

Swetha, S., Kuehne, H., Rawat, Y.S., Shah, M.: Unsupervised discriminative em- bedding for sub-action learning in complex activities. In: 2021 IEEE International Conference on Image Processing (ICIP). pp. 2588–2592. IEEE (2021)

work page 2021
[18]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Advances in Neural Information Processing Systems32(2019)

Titouan, V., Flamary, R., Courty, N., Tavenard, R., Chapel, L.: Sliced gromov- wasserstein. Advances in Neural Information Processing Systems32(2019)

work page 2019
[20]

IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparo- scopic videos. IEEE Transactions on Medical Imaging36(1), 86–97 (2017). https://doi.org/10.1109/TMI.2016.2593957

work page doi:10.1109/tmi.2016.2593957 2017
[21]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual- temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1238–1247 (2021)

work page 2021
[22]

In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 486–496. Springer (2022)

work page 2022
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xu, M., Gould, S.: Temporally consistent unbalanced optimal transport for un- supervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14618–14627 (June 2024)

work page 2024
[24]

Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

Yuan, K., Navab, N., Padoy, N., et al.: Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

work page 2024
[25]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Yuan, K., Srivastav, V., Navab, N., Padoy, N.: Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 306–316. Springer (2024)

work page 2024
[26]

Medical Image Analysis p

Yuan, K., Srivastav, V., Yu, T., Lavanchy, J.L., Marescaux, J., Mascagni, P., Navab, N., Padoy, N.: Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis p. 103644 (2025)

work page 2025

[1] [1]

arXiv preprint arXiv:2512.10942 (2025)

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: Vl-jepa: Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942 (2025)

work page arXiv 2025

[2] [2]

In: International conference on medical image computing and computer-assisted intervention

Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convo- lutional networks. In: International conference on medical image computing and computer-assisted intervention. pp. 343–352. Springer (2020)

work page 2020

[3] [3]

IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024)

Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of mod- ern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024). https://doi.org/10.1109/TPAMI.2023.3327284

work page doi:10.1109/tpami.2023.3327284 2024

[4] [4]

International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)

Fazzari, E., Romano, D., Falchi, F., Stefanini, C.: Artemis: animal recognition through enhanced multimodal integration system. International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)

work page 2025

[5] [5]

Research Square (2026)

Fazzari, E., Stefanini, C.: Deep reinforcement learning for surgical robotics with state and image information: A survey. Research Square (2026). https://doi.org/10.21203/rs.3.rs-8621244/v1

work page doi:10.21203/rs.3.rs-8621244/v1 2026

[6] [6]

Google DeepMind: Gemini 2.0 flash model card. Tech. rep., Google (April 2025), available athttps://modelcards.withgoogle.com/assets/documents/ gemini-2-flash.pdf

work page 2025

[7] [7]

Surgical Endoscopy37(8), 6588–6601 (2023)

Hashemi,N.,Svendsen,M.B.S.,Bjerrum,F.,Rasmussen,S.,Tolsgaard,M.G.,Friis, M.L.: Acquisition and usage of robotic surgical data for machine learning analysis. Surgical Endoscopy37(8), 6588–6601 (2023)

work page 2023

[8] [8]

International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)

Lavanchy, J.L., Ramesh, S., Dall’Alba, D., Gonzalez, C., Fiorini, P., Müller-Stich, B.P., Nett, P.C., Marescaux, J., Mutter, D., Padoy, N.: Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)

work page 2024

[9] [9]

Medical Image Analysis99, 103366 (2025)

Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recog- nition. Medical Image Analysis99, 103366 (2025)

work page 2025

[10] [10]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Miech,A.,Alayrac,J.B.,Smaira,L.,Laptev,I.,Sivic,J.,Zisserman,A.:End-to-end learning of visual representations from uncurated instructional videos. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9879–9889 (2020)

work page 2020

[11] [11]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[12] [12]

In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)

Ramesh, S., Dall’Alba, D., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Fiorini, P., Padoy, N.: Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)

work page 2021

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sarfraz, S., Murray, N., Sharma, V., Diba, A., Van Gool, L., Stiefelhagen, R.: Temporally-weighted hierarchical clustering for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11225–11234 (2021) 10 O. Mohamed et al

work page 2021

[14] [14]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8368–8376 (2018)

work page 2018

[15] [15]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Spurio, F., Bahrami, E., Francesca, G., Gall, J.: Hierarchical vector quantization for unsupervised action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 6996–7005 (2025)

work page 2025

[17] [17]

In: 2021 IEEE International Conference on Image Processing (ICIP)

Swetha, S., Kuehne, H., Rawat, Y.S., Shah, M.: Unsupervised discriminative em- bedding for sub-action learning in complex activities. In: 2021 IEEE International Conference on Image Processing (ICIP). pp. 2588–2592. IEEE (2021)

work page 2021

[18] [18]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Advances in Neural Information Processing Systems32(2019)

Titouan, V., Flamary, R., Courty, N., Tavenard, R., Chapel, L.: Sliced gromov- wasserstein. Advances in Neural Information Processing Systems32(2019)

work page 2019

[20] [20]

IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparo- scopic videos. IEEE Transactions on Medical Imaging36(1), 86–97 (2017). https://doi.org/10.1109/TMI.2016.2593957

work page doi:10.1109/tmi.2016.2593957 2017

[21] [21]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual- temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1238–1247 (2021)

work page 2021

[22] [22]

In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 486–496. Springer (2022)

work page 2022

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xu, M., Gould, S.: Temporally consistent unbalanced optimal transport for un- supervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14618–14627 (June 2024)

work page 2024

[24] [24]

Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

Yuan, K., Navab, N., Padoy, N., et al.: Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

work page 2024

[25] [25]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Yuan, K., Srivastav, V., Navab, N., Padoy, N.: Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 306–316. Springer (2024)

work page 2024

[26] [26]

Medical Image Analysis p

Yuan, K., Srivastav, V., Yu, T., Lavanchy, J.L., Marescaux, J., Mascagni, P., Navab, N., Padoy, N.: Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis p. 103644 (2025)

work page 2025