Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics
Pith reviewed 2026-05-21 11:33 UTC · model grok-4.3
The pith
TASOT performs training-free surgical temporal segmentation by fusing video visuals with automatically generated text descriptions inside an unbalanced optimal transport objective.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TASOT extends the Action Segmentation Optimal Transport formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3 while temporal captions from a vision-language model are encoded via CLIP and aligned to frames. The method reports F1-score gains of 18.9 on Cholec80, 33.7 on AutoLaparo, 23.7 on StrasByPass70, and 4.5 on BernByPass70 over the strongest zero-shot baselines.
What carries the argument
The unified unbalanced Gromov-Wasserstein optimal transport objective that integrates DINOv3 visual features with CLIP-encoded temporally aligned text captions.
If this is right
- Surgical workflow segmentation becomes feasible on new robotic platforms without collecting or annotating task-specific videos.
- Intraoperative decision support and automation can rely on raw video input alone.
- Skill assessment and workflow analysis scale across laparoscopic and robotic procedures without domain-specific pretraining.
- The same multimodal transport setup can be applied to other temporal segmentation benchmarks that currently require labeled data.
Where Pith is reading between the lines
- If caption quality drives the gains, swapping in stronger vision-language models should produce further measurable lifts on the same datasets.
- The frame-level alignment step could be tested for extension to finer-grained surgical step recognition rather than phase-level labels.
- Similar text-augmented transport might transfer to non-medical video domains such as sports action segmentation or surveillance event detection.
Load-bearing premise
The textual descriptions generated from the input video by a vision-language model supply accurate complementary semantic information without introducing substantial misalignment or hallucination errors.
What would settle it
A controlled ablation that removes the text component from the transport cost and measures whether the reported F1 gains disappear or reverse on the same datasets.
Figures
read the original abstract
Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TASOT, an annotation-free extension of the Action Segmentation Optimal Transport (ASOT) framework for surgical temporal segmentation. It augments the unbalanced Gromov-Wasserstein objective with temporally aligned textual captions generated by a vision-language model from the input video (encoded via CLIP) fused with DINOv3 visual features, and reports large F1 gains over zero-shot baselines on Cholec80 (+18.9), AutoLaparo (+33.7), StrasByPass70 (+23.7), and BernByPass70 (+4.5).
Significance. If the central claim holds, the work would offer a meaningful advance toward practical, training-free surgical workflow analysis across robotic platforms by leveraging public datasets and avoiding domain-specific pretraining. The multimodal extension of ASOT and the breadth of evaluated benchmarks (laparoscopic and robotic procedures) are positive aspects that would strengthen the contribution if the text modality's reliability is substantiated.
major comments (3)
- [§3] §3 (Method): The description of the unified unbalanced Gromov-Wasserstein objective and the visual-text fusion mechanism lacks the explicit cost-matrix formulation and the precise role of the visual-text fusion weight; without these equations it is impossible to verify that the reported gains arise from complementary semantic structure rather than from an additional tunable hyperparameter or from the base ASOT component alone.
- [§4] §4 (Experiments) and associated result tables: The main quantitative tables report substantial F1 improvements, yet no ablation isolating the text modality, no quantitative metrics on VLM caption accuracy or temporal alignment error relative to ground-truth phases, and no transport-plan comparison with versus without the CLIP component are provided; these omissions directly undermine confidence that the +18.9–33.7 F1 gains are attributable to reliable multimodal fusion in the surgical domain.
- [§4.2] §4.2 or §5 (Discussion): The load-bearing assumption that VLM-generated captions supply domain-appropriate semantic cues without systematic hallucination or phase misalignment is stated but not tested; given that surgical actions and terminology lie far outside typical VLM training distributions, the absence of any error analysis or sensitivity study on caption quality leaves the central claim vulnerable.
minor comments (2)
- [Abstract] Abstract: The claim of evaluation on 'three public surgical datasets and four benchmark settings' is inconsistent with the four datasets explicitly named; clarify the exact dataset count and benchmark definitions.
- [Notation] Notation: Ensure that symbols for the unbalanced Gromov-Wasserstein distance, the fusion weight, and the temporal alignment operator are defined once and used consistently between the method equations and the experimental description.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which highlight important aspects for strengthening the clarity and validation of our TASOT framework. We address each major comment point by point below and will revise the manuscript accordingly to incorporate the requested details and analyses.
read point-by-point responses
-
Referee: [§3] §3 (Method): The description of the unified unbalanced Gromov-Wasserstein objective and the visual-text fusion mechanism lacks the explicit cost-matrix formulation and the precise role of the visual-text fusion weight; without these equations it is impossible to verify that the reported gains arise from complementary semantic structure rather than from an additional tunable hyperparameter or from the base ASOT component alone.
Authors: We agree that the method section would benefit from greater mathematical precision. The original submission described the fusion at a conceptual level but did not provide the explicit cost-matrix equations or the role of the fusion weight. In the revised manuscript we will add the full formulation of the unbalanced Gromov-Wasserstein objective, explicitly defining the fused cost matrix as a convex combination C = λ C_visual + (1-λ) C_text (where C_visual is computed from DINOv3 frame features and C_text from CLIP embeddings of temporally aligned VLM captions) and stating that λ is a fixed hyperparameter chosen via a small validation sweep. We will also include a brief sensitivity plot for λ to demonstrate that the reported gains are not an artifact of a single tuned value. revision: yes
-
Referee: [§4] §4 (Experiments) and associated result tables: The main quantitative tables report substantial F1 improvements, yet no ablation isolating the text modality, no quantitative metrics on VLM caption accuracy or temporal alignment error relative to ground-truth phases, and no transport-plan comparison with versus without the CLIP component are provided; these omissions directly undermine confidence that the +18.9–33.7 F1 gains are attributable to reliable multimodal fusion in the surgical domain.
Authors: We acknowledge that the current experimental section does not contain the requested ablations and supporting analyses. In the revision we will add (i) a direct ablation of TASOT versus the original ASOT baseline (i.e., with and without the text modality), (ii) quantitative caption-quality metrics (e.g., temporal alignment error measured against ground-truth phase boundaries and semantic similarity scores where reference descriptions exist), and (iii) side-by-side visualizations or quantitative comparisons of the learned transport plans with and without the CLIP component. These additions will allow readers to attribute the F1 gains specifically to the multimodal fusion. revision: yes
-
Referee: [§4.2] §4.2 or §5 (Discussion): The load-bearing assumption that VLM-generated captions supply domain-appropriate semantic cues without systematic hallucination or phase misalignment is stated but not tested; given that surgical actions and terminology lie far outside typical VLM training distributions, the absence of any error analysis or sensitivity study on caption quality leaves the central claim vulnerable.
Authors: We recognize the importance of directly validating the reliability of VLM captions in the surgical domain. While the consistent gains across four benchmarks provide indirect support, we agree that explicit error analysis is warranted. In the revised discussion we will add a sensitivity study that varies the VLM (and prompt) and reports observed caption error rates and phase misalignment statistics. We will also discuss the limitations arising from potential hallucinations and how the unbalanced transport formulation partially mitigates them. revision: yes
Circularity Check
No circularity: extension of prior ASOT uses external pre-trained models and reports results on public benchmarks
full rationale
The paper defines TASOT by extending the existing ASOT formulation with multimodal inputs from DINOv3 and CLIP (pre-trained external models) inside an unbalanced Gromov-Wasserstein objective, then evaluates the resulting method on independent public datasets (Cholec80, AutoLaparo, StrasByPass70, BernByPass70) against zero-shot baselines. No equations or steps reduce by construction to fitted parameters, self-definitions, or unverified self-citations; the performance deltas are measured outcomes rather than tautological outputs. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- visual-text fusion weight
axioms (1)
- domain assumption Temporally aligned captions generated by a vision-language model supply reliable complementary semantic structure for surgical phase matching
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions ... within a unified unbalanced Gromov-Wasserstein optimal transport objective. ... C_{i,k} = β C_img_{i,k} + (1-β) C_text_{i,k}
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use modest off-the-shelf encoders ... no task-specific supervised pretraining and no massive backbone architectures tailored to surgical data.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2512.10942 (2025)
Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: Vl-jepa: Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942 (2025)
-
[2]
In: International conference on medical image computing and computer-assisted intervention
Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convo- lutional networks. In: International conference on medical image computing and computer-assisted intervention. pp. 343–352. Springer (2020)
work page 2020
-
[3]
IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024)
Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of mod- ern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(2), 1011–1030 (2024). https://doi.org/10.1109/TPAMI.2023.3327284
-
[4]
International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)
Fazzari, E., Romano, D., Falchi, F., Stefanini, C.: Artemis: animal recognition through enhanced multimodal integration system. International Journal of Ma- chine Learning and Cybernetics16(9), 5877–5892 (2025)
work page 2025
-
[5]
Fazzari, E., Stefanini, C.: Deep reinforcement learning for surgical robotics with state and image information: A survey. Research Square (2026). https://doi.org/10.21203/rs.3.rs-8621244/v1
-
[6]
Google DeepMind: Gemini 2.0 flash model card. Tech. rep., Google (April 2025), available athttps://modelcards.withgoogle.com/assets/documents/ gemini-2-flash.pdf
work page 2025
-
[7]
Surgical Endoscopy37(8), 6588–6601 (2023)
Hashemi,N.,Svendsen,M.B.S.,Bjerrum,F.,Rasmussen,S.,Tolsgaard,M.G.,Friis, M.L.: Acquisition and usage of robotic surgical data for machine learning analysis. Surgical Endoscopy37(8), 6588–6601 (2023)
work page 2023
-
[8]
International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)
Lavanchy, J.L., Ramesh, S., Dall’Alba, D., Gonzalez, C., Fiorini, P., Müller-Stich, B.P., Nett, P.C., Marescaux, J., Mutter, D., Padoy, N.: Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery19(11), 2249– 2257 (2024)
work page 2024
-
[9]
Medical Image Analysis99, 103366 (2025)
Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recog- nition. Medical Image Analysis99, 103366 (2025)
work page 2025
-
[10]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Miech,A.,Alayrac,J.B.,Smaira,L.,Laptev,I.,Sivic,J.,Zisserman,A.:End-to-end learning of visual representations from uncurated instructional videos. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9879–9889 (2020)
work page 2020
-
[11]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[12]
In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)
Ramesh, S., Dall’Alba, D., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Fiorini, P., Padoy, N.: Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. In- ternational journal of computer assisted radiology and surgery16(7), 1111–1119 (2021)
work page 2021
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Sarfraz, S., Murray, N., Sharma, V., Diba, A., Van Gool, L., Stiefelhagen, R.: Temporally-weighted hierarchical clustering for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11225–11234 (2021) 10 O. Mohamed et al
work page 2021
-
[14]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8368–8376 (2018)
work page 2018
-
[15]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Spurio, F., Bahrami, E., Francesca, G., Gall, J.: Hierarchical vector quantization for unsupervised action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 6996–7005 (2025)
work page 2025
-
[17]
In: 2021 IEEE International Conference on Image Processing (ICIP)
Swetha, S., Kuehne, H., Rawat, Y.S., Shah, M.: Unsupervised discriminative em- bedding for sub-action learning in complex activities. In: 2021 IEEE International Conference on Image Processing (ICIP). pp. 2588–2592. IEEE (2021)
work page 2021
-
[18]
Gemma: Open Models Based on Gemini Research and Technology
Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Advances in Neural Information Processing Systems32(2019)
Titouan, V., Flamary, R., Courty, N., Tavenard, R., Chapel, L.: Sliced gromov- wasserstein. Advances in Neural Information Processing Systems32(2019)
work page 2019
-
[20]
IEEE Transactions on Medical Imaging36(1), 86–97 (2017)
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparo- scopic videos. IEEE Transactions on Medical Imaging36(1), 86–97 (2017). https://doi.org/10.1109/TMI.2016.2593957
-
[21]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual- temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1238–1247 (2021)
work page 2021
-
[22]
In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention
Wang, Z., Lu, B., Long, Y., Zhong, F., Cheung, T.H., Dou, Q., Liu, Y.: Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 486–496. Springer (2022)
work page 2022
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xu, M., Gould, S.: Temporally consistent unbalanced optimal transport for un- supervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14618–14627 (June 2024)
work page 2024
-
[24]
Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)
Yuan, K., Navab, N., Padoy, N., et al.: Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. Advances in Neural Infor- mation Processing Systems37, 122952–122983 (2024)
work page 2024
-
[25]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Yuan, K., Srivastav, V., Navab, N., Padoy, N.: Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 306–316. Springer (2024)
work page 2024
-
[26]
Yuan, K., Srivastav, V., Yu, T., Lavanchy, J.L., Marescaux, J., Mascagni, P., Navab, N., Padoy, N.: Learning multi-modal representations by watching hundreds of surgical video lectures. Medical Image Analysis p. 103644 (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.