Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models
Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3
The pith
Vision foundation model features outperform supervised encoders for phase segmentation in small-incision cataract surgery under fixed temporal modeling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the controlled setup that fixes the temporal model and training protocol, features extracted from large self-supervised vision foundation models improve segmentation performance over those from supervised encoders on SICS-155 videos. DINOv3 ViT-7B delivers the strongest results at 83.4 percent accuracy and 87.0 edit score, while the study further shows that domain-specific adaptation from unlabeled videos produces mixed effects depending on the base encoder.
What carries the argument
The cached-feature pipeline that extracts visual representations once and then trains only the lightweight MS-TCN++ temporal model on those fixed features, allowing direct isolation of encoder quality.
If this is right
- Foundation-model features raise both frame-wise accuracy and edit-score metrics in low-label surgical video settings.
- DINOv3 ViT-7B features give the largest gains when paired with the fixed temporal model.
- Lightweight adaptation on unlabeled videos can improve or degrade results depending on domain match with the base encoder.
- The approach supplies concrete guidance for building phase segmentation systems when only limited annotated surgical videos exist.
Where Pith is reading between the lines
- The same controlled caching method could be applied to phase segmentation in other surgical domains to test transfer of these foundation models.
- Replacing the temporal head with a more expressive architecture might widen or narrow the gap between foundation and supervised features.
- Inference speed gains from caching could support real-time workflow monitoring if the visual encoder is also quantized or distilled.
Load-bearing premise
That holding the temporal model and training settings identical across encoders fully isolates representation quality without hidden interactions between feature statistics and the temporal head.
What would settle it
Retraining the identical encoders with a different temporal architecture such as a transformer and checking whether the performance ordering among encoders stays the same.
Figures
read the original abstract
Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a controlled empirical comparison of visual encoders for data-efficient surgical phase segmentation in small-incision cataract surgery (SICS) on the SICS-155 dataset (19 phases). Each encoder is paired with an identical MS-TCN++ temporal model under fixed training settings and a cached-feature pipeline; supervised baselines (ResNet-50, I3D) are compared against self-supervised foundation models (DINOv3, V-JEPA2). The central result is that foundation-model features improve performance, with DINOv3 ViT-7B achieving the highest scores (83.4% accuracy, 87.0 edit score), alongside analysis of cataract-domain transfer from unlabeled videos.
Significance. If the isolation of representation quality holds, the work provides concrete evidence that large vision foundation models transfer effectively to surgical workflow analysis in low-label medical video settings. The controlled protocol, cached-feature efficiency, and concrete metrics (accuracy and edit score) offer a useful benchmark and practical guidance for computer-assisted surgery, where annotation is expensive.
major comments (1)
- [Abstract and Methods (controlled comparison)] Abstract and Methods (controlled comparison): The claim that the setup 'isolates representation quality' rests on pairing every encoder with identical MS-TCN++ weights, hyperparameters, and cached features. However, the manuscript does not describe any feature normalization, dimensionality projection, or per-encoder adaptation to address mismatches in output dimension and channel statistics (e.g., ResNet-50 vs. ViT-7B). Because MS-TCN++ uses dilated convolutions sensitive to input statistics, performance differences could arise from better statistical alignment with the fixed temporal head rather than intrinsic representation power; this is load-bearing for attributing gains specifically to foundation models.
minor comments (2)
- [Abstract and Results] The abstract and results sections report concrete numbers but omit details on train/test splits, number of videos per split, or cross-validation procedure for SICS-155, which are required to evaluate robustness.
- [Results] No statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) are reported for the accuracy and edit-score differences across encoders.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our controlled empirical study. We address the major comment point-by-point below and will revise the manuscript to improve transparency on the experimental pipeline.
read point-by-point responses
-
Referee: The claim that the setup 'isolates representation quality' rests on pairing every encoder with identical MS-TCN++ weights, hyperparameters, and cached features. However, the manuscript does not describe any feature normalization, dimensionality projection, or per-encoder adaptation to address mismatches in output dimension and channel statistics (e.g., ResNet-50 vs. ViT-7B). Because MS-TCN++ uses dilated convolutions sensitive to input statistics, performance differences could arise from better statistical alignment with the fixed temporal head rather than intrinsic representation power; this is load-bearing for attributing gains specifically to foundation models.
Authors: We agree that the manuscript should have explicitly described the feature handling steps to support the isolation claim. In the implemented pipeline, raw encoder outputs were cached and passed directly to MS-TCN++ without per-encoder normalization, projection layers, or adaptation; the temporal model’s first convolutional layer accommodates varying input channel dimensions. We acknowledge that this leaves open the possibility that some performance differences arise from statistical alignment rather than representation quality alone. We will revise the Methods section to document the exact output dimensions of each encoder, confirm the absence of additional preprocessing, and temper the phrasing around 'isolating representation quality' to reflect the fixed-head controlled setup more precisely. If space permits, we will also note that a common-dimension projection ablation could be explored in follow-up work. These changes will make the attribution of gains more defensible without altering the core empirical findings. revision: yes
Circularity Check
No circularity: empirical comparison of external pre-trained encoders
full rationale
The paper conducts a controlled empirical study by extracting features from various pre-trained visual encoders (ResNet-50, I3D, DINOv3, V-JEPA2) and feeding them into a fixed MS-TCN++ temporal model under identical training settings on the SICS-155 dataset. No mathematical derivations, predictions, or first-principles results are claimed; performance metrics (e.g., 83.4% accuracy for DINOv3 ViT-7B) arise directly from experimental evaluation of externally sourced models. The setup uses cached features to decouple encoding from temporal learning, with no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The analysis is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained foundation model features transfer meaningfully to surgical video without domain-specific fine-tuning of the encoder itself.
- domain assumption The SICS-155 dataset with 19 phases is a fair proxy for real-world small-incision cataract surgery variability.
Reference graph
Works this paper leans on
-
[1]
Challenges in surgical video annota- tion,
T. M. Ward, D. M. Fer, Y . Ban, G. Rosman, O. R. Meireles, and D. A. Hashimoto, “Challenges in surgical video annota- tion,”Computer Assisted Surgery, vol. 26, no. 1, pp. 58–68,
-
[2]
Cataract surgery for the developing world,
G. Tabin, M. Michael Chen, and L. Espandar, “Cataract surgery for the developing world,”Current Opinion in Oph- thalmology, vol. 19, no. 1, pp. 55–59, 2008. 1
work page 2008
-
[3]
Manual small incision cataract surgery: A review,
R. Venkatesh, D. F. Chang, R. Muralikrishnan, K. Hemal, P. Gogate, and S. Sengupta, “Manual small incision cataract surgery: A review,”Asia-Pacific Journal of Ophthalmology, vol. 1, no. 2, pp. 113–119, 2012. 1
work page 2012
-
[4]
Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,
S. Li, Y . A. Farha, Y . Liu, M.-M. Cheng, and J. Gall, “Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,” 2020. 1, 3, 4
work page 2020
-
[5]
Sics-155: Phase recognition in small incision cataract surgery videos,
S. Mueller, “Sics-155: Phase recognition in small incision cataract surgery videos,” 2025. International Conference on Medical Image Computing and Computer-Assisted Interven- tion 2025 (MICCAI 2025), Daejeon, Republic of Korea. 1, 2
work page 2025
-
[6]
Lora: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021. 2, 4
work page 2021
-
[7]
Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,
N. Ghamsarian, Y . El-Shabrawi, S. Nasirihaghighi, D. Putzgruber-Adamitsch, M. Zinkernagel, S. Wolf, K. Schoeffmann, and R. Sznitman, “Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,” Scientific Data, vol. 11, no. 1, p. 373, 2024. 2
work page 2024
-
[8]
Cataract-101: Video dataset of 101 cataract surgeries,
K. Schoeffmann, M. Taschwer, S. Sarny, B. M ¨unzer, M. J. Primus, and D. Putzgruber, “Cataract-101: Video dataset of 101 cataract surgeries,” inProceedings of the 9th ACM Multimedia Systems Conference, pp. 421–425, 2018. 2
work page 2018
-
[9]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, 2009. 2, 4
work page 2009
-
[10]
Quo vadis, action recognition? a new model and the kinetics dataset,
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2018. 2
work page 2018
-
[11]
O. Sim´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamon- jisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “Dinov3,” 2025. 2
work page 2025
-
[12]
V-jepa 2: Self-supervised video models enable understanding, predic- tion and planning,
M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supervised...
work page 2025
-
[13]
S. Mueller, B. Sachdeva, S. N. Prasad, R. Lechtenboehmer, F. G. Holz, R. P. Finger, K. Murali, M. Jain, M. W. M. Win- tergerst, and T. Schultz, “Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,”Scientific Reports, vol. 15, no. 1, 2025. 4
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.