Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

Chen Chen; Lincoln Spencer; Song Wang

arxiv: 2604.10514 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

Lincoln Spencer , Song Wang , Chen Chen This is my paper

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords surgical phase segmentationvision foundation modelsdata-efficient learningcataract surgeryself-supervised learningvideo analysiscomputer assisted surgerytemporal modeling

0 comments

The pith

Vision foundation model features outperform supervised encoders for phase segmentation in small-incision cataract surgery under fixed temporal modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines data-efficient surgical phase segmentation for manual small-incision cataract surgery, where labeled videos are scarce. It runs a controlled comparison by pairing each visual encoder with the identical MS-TCN++ temporal model and identical training settings on the SICS-155 dataset of 19 phases. Self-supervised foundation models supply stronger features than supervised baselines such as ResNet-50 or I3D, with DINOv3 ViT-7B reaching the highest accuracy and edit score. A cached-feature pipeline keeps visual encoding separate from temporal learning to make the comparison direct. The results also track how lightweight adaptation on unlabeled cataract videos affects outcomes and when it helps or hurts.

Core claim

Under the controlled setup that fixes the temporal model and training protocol, features extracted from large self-supervised vision foundation models improve segmentation performance over those from supervised encoders on SICS-155 videos. DINOv3 ViT-7B delivers the strongest results at 83.4 percent accuracy and 87.0 edit score, while the study further shows that domain-specific adaptation from unlabeled videos produces mixed effects depending on the base encoder.

What carries the argument

The cached-feature pipeline that extracts visual representations once and then trains only the lightweight MS-TCN++ temporal model on those fixed features, allowing direct isolation of encoder quality.

If this is right

Foundation-model features raise both frame-wise accuracy and edit-score metrics in low-label surgical video settings.
DINOv3 ViT-7B features give the largest gains when paired with the fixed temporal model.
Lightweight adaptation on unlabeled videos can improve or degrade results depending on domain match with the base encoder.
The approach supplies concrete guidance for building phase segmentation systems when only limited annotated surgical videos exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controlled caching method could be applied to phase segmentation in other surgical domains to test transfer of these foundation models.
Replacing the temporal head with a more expressive architecture might widen or narrow the gap between foundation and supervised features.
Inference speed gains from caching could support real-time workflow monitoring if the visual encoder is also quantized or distilled.

Load-bearing premise

That holding the temporal model and training settings identical across encoders fully isolates representation quality without hidden interactions between feature statistics and the temporal head.

What would settle it

Retraining the identical encoders with a different temporal architecture such as a transformer and checking whether the performance ordering among encoders stays the same.

Figures

Figures reproduced from arXiv: 2604.10514 by Chen Chen, Lincoln Spencer, Song Wang.

**Figure 1.** Figure 1: Controlled study pipeline. Encoders are frozen after optional cataract-domain SSL continuation and LoRA adaptation, features are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: SICS-155 phase distribution at frame level. The long-tail imbalance motivates representation learning approaches that remain [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative phase segmentation results for BL [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Per-phase change from CataractFT + LoRA relative to SSL-only continuation from the same cataract-domain checkpoint (no [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This controlled encoder-swap study shows DINOv3 features outperforming standard models on a small SICS dataset with a cached MS-TCN++ pipeline, but the fixed temporal head leaves open whether gains truly isolate representation quality.

read the letter

The main result is that large self-supervised encoders, led by DINOv3 ViT-7B, reach 83.4% accuracy and 87.0 edit score on phase segmentation for small-incision cataract surgery when paired with the same MS-TCN++ temporal model. The setup uses cached features on the SICS-155 set to keep training lightweight, and it also tests simple domain adaptation from extra unlabeled cataract videos. This gives a clear empirical picture of how public foundation models transfer to a data-scarce surgical video task without retraining everything from scratch. The design is practical and the numbers are reported directly, which is helpful for anyone working in low-label medical video settings. The paper sticks to what the experiments show and avoids broad claims about new architectures. It is the kind of work that fills a gap between general vision foundation models and specialized surgical applications. One soft spot is the isolation of representation quality. The encoders differ in output dimension and feature statistics, yet the abstract gives no sign of normalization, projection layers, or per-encoder tuning for the MS-TCN++ head. Without those steps, some of the observed gap could trace to better statistical fit with the dilated convolutions rather than intrinsic representation strength. The full methods would need to confirm whether this was addressed. This paper is for readers focused on medical video analysis or practical transfer of foundation models. It is not a theoretical contribution but supplies concrete benchmarks and guidance on when adaptation helps. I would send it to peer review. The controlled comparison and real dataset make it worth referee time, even if revisions are needed on the feature handling details.

Referee Report

1 major / 2 minor

Summary. The paper presents a controlled empirical comparison of visual encoders for data-efficient surgical phase segmentation in small-incision cataract surgery (SICS) on the SICS-155 dataset (19 phases). Each encoder is paired with an identical MS-TCN++ temporal model under fixed training settings and a cached-feature pipeline; supervised baselines (ResNet-50, I3D) are compared against self-supervised foundation models (DINOv3, V-JEPA2). The central result is that foundation-model features improve performance, with DINOv3 ViT-7B achieving the highest scores (83.4% accuracy, 87.0 edit score), alongside analysis of cataract-domain transfer from unlabeled videos.

Significance. If the isolation of representation quality holds, the work provides concrete evidence that large vision foundation models transfer effectively to surgical workflow analysis in low-label medical video settings. The controlled protocol, cached-feature efficiency, and concrete metrics (accuracy and edit score) offer a useful benchmark and practical guidance for computer-assisted surgery, where annotation is expensive.

major comments (1)

[Abstract and Methods (controlled comparison)] Abstract and Methods (controlled comparison): The claim that the setup 'isolates representation quality' rests on pairing every encoder with identical MS-TCN++ weights, hyperparameters, and cached features. However, the manuscript does not describe any feature normalization, dimensionality projection, or per-encoder adaptation to address mismatches in output dimension and channel statistics (e.g., ResNet-50 vs. ViT-7B). Because MS-TCN++ uses dilated convolutions sensitive to input statistics, performance differences could arise from better statistical alignment with the fixed temporal head rather than intrinsic representation power; this is load-bearing for attributing gains specifically to foundation models.

minor comments (2)

[Abstract and Results] The abstract and results sections report concrete numbers but omit details on train/test splits, number of videos per split, or cross-validation procedure for SICS-155, which are required to evaluate robustness.
[Results] No statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) are reported for the accuracy and edit-score differences across encoders.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our controlled empirical study. We address the major comment point-by-point below and will revise the manuscript to improve transparency on the experimental pipeline.

read point-by-point responses

Referee: The claim that the setup 'isolates representation quality' rests on pairing every encoder with identical MS-TCN++ weights, hyperparameters, and cached features. However, the manuscript does not describe any feature normalization, dimensionality projection, or per-encoder adaptation to address mismatches in output dimension and channel statistics (e.g., ResNet-50 vs. ViT-7B). Because MS-TCN++ uses dilated convolutions sensitive to input statistics, performance differences could arise from better statistical alignment with the fixed temporal head rather than intrinsic representation power; this is load-bearing for attributing gains specifically to foundation models.

Authors: We agree that the manuscript should have explicitly described the feature handling steps to support the isolation claim. In the implemented pipeline, raw encoder outputs were cached and passed directly to MS-TCN++ without per-encoder normalization, projection layers, or adaptation; the temporal model’s first convolutional layer accommodates varying input channel dimensions. We acknowledge that this leaves open the possibility that some performance differences arise from statistical alignment rather than representation quality alone. We will revise the Methods section to document the exact output dimensions of each encoder, confirm the absence of additional preprocessing, and temper the phrasing around 'isolating representation quality' to reflect the fixed-head controlled setup more precisely. If space permits, we will also note that a common-dimension projection ablation could be explored in follow-up work. These changes will make the attribution of gains more defensible without altering the core empirical findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of external pre-trained encoders

full rationale

The paper conducts a controlled empirical study by extracting features from various pre-trained visual encoders (ResNet-50, I3D, DINOv3, V-JEPA2) and feeding them into a fixed MS-TCN++ temporal model under identical training settings on the SICS-155 dataset. No mathematical derivations, predictions, or first-principles results are claimed; performance metrics (e.g., 83.4% accuracy for DINOv3 ViT-7B) arise directly from experimental evaluation of externally sourced models. The setup uses cached features to decouple encoding from temporal learning, with no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The analysis is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of transfer learning and the representativeness of the SICS-155 dataset; no new entities or ad-hoc parameters are introduced beyond the choice of MS-TCN++ architecture.

axioms (2)

domain assumption Pre-trained foundation model features transfer meaningfully to surgical video without domain-specific fine-tuning of the encoder itself.
Invoked when the authors freeze encoders and only train the temporal head.
domain assumption The SICS-155 dataset with 19 phases is a fair proxy for real-world small-incision cataract surgery variability.
Used to generalize the reported accuracy and edit scores.

pith-pipeline@v0.9.0 · 5520 in / 1326 out tokens · 31756 ms · 2026-05-10T16:31:20.500371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Challenges in surgical video annota- tion,

T. M. Ward, D. M. Fer, Y . Ban, G. Rosman, O. R. Meireles, and D. A. Hashimoto, “Challenges in surgical video annota- tion,”Computer Assisted Surgery, vol. 26, no. 1, pp. 58–68,

work page
[2]

Cataract surgery for the developing world,

G. Tabin, M. Michael Chen, and L. Espandar, “Cataract surgery for the developing world,”Current Opinion in Oph- thalmology, vol. 19, no. 1, pp. 55–59, 2008. 1

work page 2008
[3]

Manual small incision cataract surgery: A review,

R. Venkatesh, D. F. Chang, R. Muralikrishnan, K. Hemal, P. Gogate, and S. Sengupta, “Manual small incision cataract surgery: A review,”Asia-Pacific Journal of Ophthalmology, vol. 1, no. 2, pp. 113–119, 2012. 1

work page 2012
[4]

Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,

S. Li, Y . A. Farha, Y . Liu, M.-M. Cheng, and J. Gall, “Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,” 2020. 1, 3, 4

work page 2020
[5]

Sics-155: Phase recognition in small incision cataract surgery videos,

S. Mueller, “Sics-155: Phase recognition in small incision cataract surgery videos,” 2025. International Conference on Medical Image Computing and Computer-Assisted Interven- tion 2025 (MICCAI 2025), Daejeon, Republic of Korea. 1, 2

work page 2025
[6]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021. 2, 4

work page 2021
[7]

Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,

N. Ghamsarian, Y . El-Shabrawi, S. Nasirihaghighi, D. Putzgruber-Adamitsch, M. Zinkernagel, S. Wolf, K. Schoeffmann, and R. Sznitman, “Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,” Scientific Data, vol. 11, no. 1, p. 373, 2024. 2

work page 2024
[8]

Cataract-101: Video dataset of 101 cataract surgeries,

K. Schoeffmann, M. Taschwer, S. Sarny, B. M ¨unzer, M. J. Primus, and D. Putzgruber, “Cataract-101: Video dataset of 101 cataract surgeries,” inProceedings of the 9th ACM Multimedia Systems Conference, pp. 421–425, 2018. 2

work page 2018
[9]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, 2009. 2, 4

work page 2009
[10]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2018. 2

work page 2018
[11]

Sim´eoni, H

O. Sim´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamon- jisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “Dinov3,” 2025. 2

work page 2025
[12]

V-jepa 2: Self-supervised video models enable understanding, predic- tion and planning,

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supervised...

work page 2025
[13]

Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,

S. Mueller, B. Sachdeva, S. N. Prasad, R. Lechtenboehmer, F. G. Holz, R. P. Finger, K. Murali, M. Jain, M. W. M. Win- tergerst, and T. Schultz, “Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,”Scientific Reports, vol. 15, no. 1, 2025. 4

work page 2025

[1] [1]

Challenges in surgical video annota- tion,

T. M. Ward, D. M. Fer, Y . Ban, G. Rosman, O. R. Meireles, and D. A. Hashimoto, “Challenges in surgical video annota- tion,”Computer Assisted Surgery, vol. 26, no. 1, pp. 58–68,

work page

[2] [2]

Cataract surgery for the developing world,

G. Tabin, M. Michael Chen, and L. Espandar, “Cataract surgery for the developing world,”Current Opinion in Oph- thalmology, vol. 19, no. 1, pp. 55–59, 2008. 1

work page 2008

[3] [3]

Manual small incision cataract surgery: A review,

R. Venkatesh, D. F. Chang, R. Muralikrishnan, K. Hemal, P. Gogate, and S. Sengupta, “Manual small incision cataract surgery: A review,”Asia-Pacific Journal of Ophthalmology, vol. 1, no. 2, pp. 113–119, 2012. 1

work page 2012

[4] [4]

Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,

S. Li, Y . A. Farha, Y . Liu, M.-M. Cheng, and J. Gall, “Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,” 2020. 1, 3, 4

work page 2020

[5] [5]

Sics-155: Phase recognition in small incision cataract surgery videos,

S. Mueller, “Sics-155: Phase recognition in small incision cataract surgery videos,” 2025. International Conference on Medical Image Computing and Computer-Assisted Interven- tion 2025 (MICCAI 2025), Daejeon, Republic of Korea. 1, 2

work page 2025

[6] [6]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021. 2, 4

work page 2021

[7] [7]

Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,

N. Ghamsarian, Y . El-Shabrawi, S. Nasirihaghighi, D. Putzgruber-Adamitsch, M. Zinkernagel, S. Wolf, K. Schoeffmann, and R. Sznitman, “Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,” Scientific Data, vol. 11, no. 1, p. 373, 2024. 2

work page 2024

[8] [8]

Cataract-101: Video dataset of 101 cataract surgeries,

K. Schoeffmann, M. Taschwer, S. Sarny, B. M ¨unzer, M. J. Primus, and D. Putzgruber, “Cataract-101: Video dataset of 101 cataract surgeries,” inProceedings of the 9th ACM Multimedia Systems Conference, pp. 421–425, 2018. 2

work page 2018

[9] [9]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, 2009. 2, 4

work page 2009

[10] [10]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2018. 2

work page 2018

[11] [11]

Sim´eoni, H

O. Sim´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamon- jisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “Dinov3,” 2025. 2

work page 2025

[12] [12]

V-jepa 2: Self-supervised video models enable understanding, predic- tion and planning,

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supervised...

work page 2025

[13] [13]

Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,

S. Mueller, B. Sachdeva, S. N. Prasad, R. Lechtenboehmer, F. G. Holz, R. P. Finger, K. Murali, M. Jain, M. W. M. Win- tergerst, and T. Schultz, “Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,”Scientific Reports, vol. 15, no. 1, 2025. 4

work page 2025