pith. sign in

arxiv: 2602.24138 · v2 · pith:DGRYKRXTnew · submitted 2026-02-27 · 💻 cs.CV · cs.AI

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

Pith reviewed 2026-05-21 11:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords surgical phase segmentationoptimal transportzero-shot learningmultimodal fusiontemporal segmentationrobotic surgerytraining-free methods
0
0 comments X

The pith

TASOT performs training-free surgical temporal segmentation by fusing video visuals with automatically generated text descriptions inside an unbalanced optimal transport objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TASOT as an annotation-free method for dividing surgical videos into phases and steps. It extends prior optimal transport work by generating text captions from the raw video itself, aligning those captions to frames, and combining them with visual features in one transport problem. This produces large accuracy gains on several laparoscopic and robotic datasets compared with existing zero-shot approaches. A reader would care because most current systems need large labeled surgical datasets or heavy pretraining, which blocks use on new platforms and in varied clinical settings.

Core claim

TASOT extends the Action Segmentation Optimal Transport formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3 while temporal captions from a vision-language model are encoded via CLIP and aligned to frames. The method reports F1-score gains of 18.9 on Cholec80, 33.7 on AutoLaparo, 23.7 on StrasByPass70, and 4.5 on BernByPass70 over the strongest zero-shot baselines.

What carries the argument

The unified unbalanced Gromov-Wasserstein optimal transport objective that integrates DINOv3 visual features with CLIP-encoded temporally aligned text captions.

If this is right

  • Surgical workflow segmentation becomes feasible on new robotic platforms without collecting or annotating task-specific videos.
  • Intraoperative decision support and automation can rely on raw video input alone.
  • Skill assessment and workflow analysis scale across laparoscopic and robotic procedures without domain-specific pretraining.
  • The same multimodal transport setup can be applied to other temporal segmentation benchmarks that currently require labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If caption quality drives the gains, swapping in stronger vision-language models should produce further measurable lifts on the same datasets.
  • The frame-level alignment step could be tested for extension to finer-grained surgical step recognition rather than phase-level labels.
  • Similar text-augmented transport might transfer to non-medical video domains such as sports action segmentation or surveillance event detection.

Load-bearing premise

The textual descriptions generated from the input video by a vision-language model supply accurate complementary semantic information without introducing substantial misalignment or hallucination errors.

What would settle it

A controlled ablation that removes the text component from the transport cost and measures whether the reported F1 gains disappear or reverse on the same datasets.

Figures

Figures reproduced from arXiv: 2602.24138 by Ayah Al-Naji, Cesare Stefanini, Edoardo Fazzari, Hamdan Alhadhrami, Ivan Laptev, Khalfan Hableel, Omar Mohamed, Saif AlKindi.

Figure 1
Figure 1. Figure 1: Overview of TASOT Surgical videos are divided into temporal windows and processed by a vision–language model to generate structured temporal captions. Vi￾sual features (DINOv3) and temporally aligned textual features (CLIP) are integrated within the TASOT model, where a weighted multimodal cost is used for unsupervised surgical temporal segmentation. 3 Methods We introduce TASOT (Text-Augmented Action Segm… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results We display the ground-truth (GT) and the results of TASOT (Ours) from the MultiByPass140 dataset. ularly for step-level segmentation. Specifically, performance on BernBypass140 increases from 23.0 to 48.8, surpassing even the supervised baseline, while results on StrasBypass140 improve from 30.7 to 52.4, nearly closing the gap with the unsupervised upper bound. 5 Conclusion We introduce… view at source ↗
read the original abstract

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TASOT, an annotation-free extension of the Action Segmentation Optimal Transport (ASOT) framework for surgical temporal segmentation. It augments the unbalanced Gromov-Wasserstein objective with temporally aligned textual captions generated by a vision-language model from the input video (encoded via CLIP) fused with DINOv3 visual features, and reports large F1 gains over zero-shot baselines on Cholec80 (+18.9), AutoLaparo (+33.7), StrasByPass70 (+23.7), and BernByPass70 (+4.5).

Significance. If the central claim holds, the work would offer a meaningful advance toward practical, training-free surgical workflow analysis across robotic platforms by leveraging public datasets and avoiding domain-specific pretraining. The multimodal extension of ASOT and the breadth of evaluated benchmarks (laparoscopic and robotic procedures) are positive aspects that would strengthen the contribution if the text modality's reliability is substantiated.

major comments (3)
  1. [§3] §3 (Method): The description of the unified unbalanced Gromov-Wasserstein objective and the visual-text fusion mechanism lacks the explicit cost-matrix formulation and the precise role of the visual-text fusion weight; without these equations it is impossible to verify that the reported gains arise from complementary semantic structure rather than from an additional tunable hyperparameter or from the base ASOT component alone.
  2. [§4] §4 (Experiments) and associated result tables: The main quantitative tables report substantial F1 improvements, yet no ablation isolating the text modality, no quantitative metrics on VLM caption accuracy or temporal alignment error relative to ground-truth phases, and no transport-plan comparison with versus without the CLIP component are provided; these omissions directly undermine confidence that the +18.9–33.7 F1 gains are attributable to reliable multimodal fusion in the surgical domain.
  3. [§4.2] §4.2 or §5 (Discussion): The load-bearing assumption that VLM-generated captions supply domain-appropriate semantic cues without systematic hallucination or phase misalignment is stated but not tested; given that surgical actions and terminology lie far outside typical VLM training distributions, the absence of any error analysis or sensitivity study on caption quality leaves the central claim vulnerable.
minor comments (2)
  1. [Abstract] Abstract: The claim of evaluation on 'three public surgical datasets and four benchmark settings' is inconsistent with the four datasets explicitly named; clarify the exact dataset count and benchmark definitions.
  2. [Notation] Notation: Ensure that symbols for the unbalanced Gromov-Wasserstein distance, the fusion weight, and the temporal alignment operator are defined once and used consistently between the method equations and the experimental description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects for strengthening the clarity and validation of our TASOT framework. We address each major comment point by point below and will revise the manuscript accordingly to incorporate the requested details and analyses.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The description of the unified unbalanced Gromov-Wasserstein objective and the visual-text fusion mechanism lacks the explicit cost-matrix formulation and the precise role of the visual-text fusion weight; without these equations it is impossible to verify that the reported gains arise from complementary semantic structure rather than from an additional tunable hyperparameter or from the base ASOT component alone.

    Authors: We agree that the method section would benefit from greater mathematical precision. The original submission described the fusion at a conceptual level but did not provide the explicit cost-matrix equations or the role of the fusion weight. In the revised manuscript we will add the full formulation of the unbalanced Gromov-Wasserstein objective, explicitly defining the fused cost matrix as a convex combination C = λ C_visual + (1-λ) C_text (where C_visual is computed from DINOv3 frame features and C_text from CLIP embeddings of temporally aligned VLM captions) and stating that λ is a fixed hyperparameter chosen via a small validation sweep. We will also include a brief sensitivity plot for λ to demonstrate that the reported gains are not an artifact of a single tuned value. revision: yes

  2. Referee: [§4] §4 (Experiments) and associated result tables: The main quantitative tables report substantial F1 improvements, yet no ablation isolating the text modality, no quantitative metrics on VLM caption accuracy or temporal alignment error relative to ground-truth phases, and no transport-plan comparison with versus without the CLIP component are provided; these omissions directly undermine confidence that the +18.9–33.7 F1 gains are attributable to reliable multimodal fusion in the surgical domain.

    Authors: We acknowledge that the current experimental section does not contain the requested ablations and supporting analyses. In the revision we will add (i) a direct ablation of TASOT versus the original ASOT baseline (i.e., with and without the text modality), (ii) quantitative caption-quality metrics (e.g., temporal alignment error measured against ground-truth phase boundaries and semantic similarity scores where reference descriptions exist), and (iii) side-by-side visualizations or quantitative comparisons of the learned transport plans with and without the CLIP component. These additions will allow readers to attribute the F1 gains specifically to the multimodal fusion. revision: yes

  3. Referee: [§4.2] §4.2 or §5 (Discussion): The load-bearing assumption that VLM-generated captions supply domain-appropriate semantic cues without systematic hallucination or phase misalignment is stated but not tested; given that surgical actions and terminology lie far outside typical VLM training distributions, the absence of any error analysis or sensitivity study on caption quality leaves the central claim vulnerable.

    Authors: We recognize the importance of directly validating the reliability of VLM captions in the surgical domain. While the consistent gains across four benchmarks provide indirect support, we agree that explicit error analysis is warranted. In the revised discussion we will add a sensitivity study that varies the VLM (and prompt) and reports observed caption error rates and phase misalignment statistics. We will also discuss the limitations arising from potential hallucinations and how the unbalanced transport formulation partially mitigates them. revision: yes

Circularity Check

0 steps flagged

No circularity: extension of prior ASOT uses external pre-trained models and reports results on public benchmarks

full rationale

The paper defines TASOT by extending the existing ASOT formulation with multimodal inputs from DINOv3 and CLIP (pre-trained external models) inside an unbalanced Gromov-Wasserstein objective, then evaluates the resulting method on independent public datasets (Cholec80, AutoLaparo, StrasByPass70, BernByPass70) against zero-shot baselines. No equations or steps reduce by construction to fitted parameters, self-definitions, or unverified self-citations; the performance deltas are measured outcomes rather than tautological outputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on pre-trained DINOv3 and CLIP models treated as fixed feature extractors and on the prior ASOT formulation. No new physical entities are postulated. A likely free parameter is the relative weighting of visual versus textual costs inside the transport objective.

free parameters (1)
  • visual-text fusion weight
    Hyperparameter balancing the contribution of CLIP text embeddings against DINOv3 visual features in the unified transport cost; inferred from the multimodal fusion description.
axioms (1)
  • domain assumption Temporally aligned captions generated by a vision-language model supply reliable complementary semantic structure for surgical phase matching
    Invoked when the paper states that textual cues provide complementary semantic structure to the transport cost.

pith-pipeline@v0.9.0 · 5826 in / 1541 out tokens · 64113 ms · 2026-05-21T11:33:27.294084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    arXiv preprint arXiv:2512.10942 (2025)

    Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: Vl-jepa: Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942 (2025)

  2. [2]

    In: International conference on medical image computing and computer-assisted intervention

    Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convo- lutional networks. In: International conference on medical image computing and computer-assisted intervention. pp. 343–352. Springer (2020)

  3. [3]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024)

    Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of mod- ern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024). https://doi.org/10.1109/TPAMI.2023.3327284

  4. [4]

    International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)

    Fazzari, E., Romano, D., Falchi, F., Stefanini, C.: Artemis: animal recognition through enhanced multimodal integration system. International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)

  5. [5]

    Research Square (2026)

    Fazzari, E., Stefanini, C.: Deep reinforcement learning for surgical robotics with state and image information: A survey. Research Square (2026). https://doi.org/10.21203/rs.3.rs-8621244/v1

  6. [6]

    Google DeepMind: Gemini 2.0 flash model card. Tech. rep., Google (April 2025), available athttps://modelcards.withgoogle.com/assets/documents/ gemini-2-flash.pdf

  7. [7]

    Surgical Endoscopy37(8), 6588–6601 (2023)

    Hashemi,N.,Svendsen,M.B.S.,Bjerrum,F.,Rasmussen,S.,Tolsgaard,M.G.,Friis, M.L.: Acquisition and usage of robotic surgical data for machine learning analysis. Surgical Endoscopy37(8), 6588–6601 (2023)

  8. [8]

    International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)

    Lavanchy, J.L., Ramesh, S., Dall’Alba, D., Gonzalez, C., Fiorini, P., Müller-Stich, B.P., Nett, P.C., Marescaux, J., Mutter, D., Padoy, N.: Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)

  9. [9]

    Medical Image Analysis99, 103366 (2025)

    Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recog- nition. Medical Image Analysis99, 103366 (2025)

  10. [10]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Miech,A.,Alayrac,J.B.,Smaira,L.,Laptev,I.,Sivic,J.,Zisserman,A.:End-to-end learning of visual representations from uncurated instructional videos. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9879–9889 (2020)

  11. [11]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  12. [12]

    In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)

    Ramesh, S., Dall’Alba, D., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Fiorini, P., Padoy, N.: Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sarfraz, S., Murray, N., Sharma, V., Diba, A., Van Gool, L., Stiefelhagen, R.: Temporally-weighted hierarchical clustering for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11225–11234 (2021) 10 O. Mohamed et al

  14. [14]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8368–8376 (2018)

  15. [15]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  16. [16]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Spurio, F., Bahrami, E., Francesca, G., Gall, J.: Hierarchical vector quantization for unsupervised action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 6996–7005 (2025)

  17. [17]

    In: 2021 IEEE International Conference on Image Processing (ICIP)

    Swetha, S., Kuehne, H., Rawat, Y.S., Shah, M.: Unsupervised discriminative em- bedding for sub-action learning in complex activities. In: 2021 IEEE International Conference on Image Processing (ICIP). pp. 2588–2592. IEEE (2021)

  18. [18]

    Gemma: Open Models Based on Gemini Research and Technology

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

  19. [19]

    Advances in Neural Information Processing Systems32(2019)

    Titouan, V., Flamary, R., Courty, N., Tavenard, R., Chapel, L.: Sliced gromov- wasserstein. Advances in Neural Information Processing Systems32(2019)

  20. [20]

    IEEE Transactions on Medical Imaging36(1), 86–97 (2017)

    Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparo- scopic videos. IEEE Transactions on Medical Imaging36(1), 86–97 (2017). https://doi.org/10.1109/TMI.2016.2593957

  21. [21]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual- temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1238–1247 (2021)

  22. [22]

    In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

    Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 486–496. Springer (2022)

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xu, M., Gould, S.: Temporally consistent unbalanced optimal transport for un- supervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14618–14627 (June 2024)

  24. [24]

    Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

    Yuan, K., Navab, N., Padoy, N., et al.: Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)

  25. [25]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Yuan, K., Srivastav, V., Navab, N., Padoy, N.: Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 306–316. Springer (2024)

  26. [26]

    Medical Image Analysis p

    Yuan, K., Srivastav, V., Yu, T., Lavanchy, J.L., Marescaux, J., Mascagni, P., Navab, N., Padoy, N.: Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis p. 103644 (2025)