pith. sign in

arxiv: 2604.10514 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords surgical phase segmentationvision foundation modelsdata-efficient learningcataract surgeryself-supervised learningvideo analysiscomputer assisted surgerytemporal modeling
0
0 comments X

The pith

Vision foundation model features outperform supervised encoders for phase segmentation in small-incision cataract surgery under fixed temporal modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines data-efficient surgical phase segmentation for manual small-incision cataract surgery, where labeled videos are scarce. It runs a controlled comparison by pairing each visual encoder with the identical MS-TCN++ temporal model and identical training settings on the SICS-155 dataset of 19 phases. Self-supervised foundation models supply stronger features than supervised baselines such as ResNet-50 or I3D, with DINOv3 ViT-7B reaching the highest accuracy and edit score. A cached-feature pipeline keeps visual encoding separate from temporal learning to make the comparison direct. The results also track how lightweight adaptation on unlabeled cataract videos affects outcomes and when it helps or hurts.

Core claim

Under the controlled setup that fixes the temporal model and training protocol, features extracted from large self-supervised vision foundation models improve segmentation performance over those from supervised encoders on SICS-155 videos. DINOv3 ViT-7B delivers the strongest results at 83.4 percent accuracy and 87.0 edit score, while the study further shows that domain-specific adaptation from unlabeled videos produces mixed effects depending on the base encoder.

What carries the argument

The cached-feature pipeline that extracts visual representations once and then trains only the lightweight MS-TCN++ temporal model on those fixed features, allowing direct isolation of encoder quality.

If this is right

  • Foundation-model features raise both frame-wise accuracy and edit-score metrics in low-label surgical video settings.
  • DINOv3 ViT-7B features give the largest gains when paired with the fixed temporal model.
  • Lightweight adaptation on unlabeled videos can improve or degrade results depending on domain match with the base encoder.
  • The approach supplies concrete guidance for building phase segmentation systems when only limited annotated surgical videos exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same controlled caching method could be applied to phase segmentation in other surgical domains to test transfer of these foundation models.
  • Replacing the temporal head with a more expressive architecture might widen or narrow the gap between foundation and supervised features.
  • Inference speed gains from caching could support real-time workflow monitoring if the visual encoder is also quantized or distilled.

Load-bearing premise

That holding the temporal model and training settings identical across encoders fully isolates representation quality without hidden interactions between feature statistics and the temporal head.

What would settle it

Retraining the identical encoders with a different temporal architecture such as a transformer and checking whether the performance ordering among encoders stays the same.

Figures

Figures reproduced from arXiv: 2604.10514 by Chen Chen, Lincoln Spencer, Song Wang.

Figure 1
Figure 1. Figure 1: Controlled study pipeline. Encoders are frozen after optional cataract-domain SSL continuation and LoRA adaptation, features are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SICS-155 phase distribution at frame level. The long-tail imbalance motivates representation learning approaches that remain [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative phase segmentation results for BL [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-phase change from CataractFT + LoRA relative to SSL-only continuation from the same cataract-domain checkpoint (no [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a controlled empirical comparison of visual encoders for data-efficient surgical phase segmentation in small-incision cataract surgery (SICS) on the SICS-155 dataset (19 phases). Each encoder is paired with an identical MS-TCN++ temporal model under fixed training settings and a cached-feature pipeline; supervised baselines (ResNet-50, I3D) are compared against self-supervised foundation models (DINOv3, V-JEPA2). The central result is that foundation-model features improve performance, with DINOv3 ViT-7B achieving the highest scores (83.4% accuracy, 87.0 edit score), alongside analysis of cataract-domain transfer from unlabeled videos.

Significance. If the isolation of representation quality holds, the work provides concrete evidence that large vision foundation models transfer effectively to surgical workflow analysis in low-label medical video settings. The controlled protocol, cached-feature efficiency, and concrete metrics (accuracy and edit score) offer a useful benchmark and practical guidance for computer-assisted surgery, where annotation is expensive.

major comments (1)
  1. [Abstract and Methods (controlled comparison)] Abstract and Methods (controlled comparison): The claim that the setup 'isolates representation quality' rests on pairing every encoder with identical MS-TCN++ weights, hyperparameters, and cached features. However, the manuscript does not describe any feature normalization, dimensionality projection, or per-encoder adaptation to address mismatches in output dimension and channel statistics (e.g., ResNet-50 vs. ViT-7B). Because MS-TCN++ uses dilated convolutions sensitive to input statistics, performance differences could arise from better statistical alignment with the fixed temporal head rather than intrinsic representation power; this is load-bearing for attributing gains specifically to foundation models.
minor comments (2)
  1. [Abstract and Results] The abstract and results sections report concrete numbers but omit details on train/test splits, number of videos per split, or cross-validation procedure for SICS-155, which are required to evaluate robustness.
  2. [Results] No statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) are reported for the accuracy and edit-score differences across encoders.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our controlled empirical study. We address the major comment point-by-point below and will revise the manuscript to improve transparency on the experimental pipeline.

read point-by-point responses
  1. Referee: The claim that the setup 'isolates representation quality' rests on pairing every encoder with identical MS-TCN++ weights, hyperparameters, and cached features. However, the manuscript does not describe any feature normalization, dimensionality projection, or per-encoder adaptation to address mismatches in output dimension and channel statistics (e.g., ResNet-50 vs. ViT-7B). Because MS-TCN++ uses dilated convolutions sensitive to input statistics, performance differences could arise from better statistical alignment with the fixed temporal head rather than intrinsic representation power; this is load-bearing for attributing gains specifically to foundation models.

    Authors: We agree that the manuscript should have explicitly described the feature handling steps to support the isolation claim. In the implemented pipeline, raw encoder outputs were cached and passed directly to MS-TCN++ without per-encoder normalization, projection layers, or adaptation; the temporal model’s first convolutional layer accommodates varying input channel dimensions. We acknowledge that this leaves open the possibility that some performance differences arise from statistical alignment rather than representation quality alone. We will revise the Methods section to document the exact output dimensions of each encoder, confirm the absence of additional preprocessing, and temper the phrasing around 'isolating representation quality' to reflect the fixed-head controlled setup more precisely. If space permits, we will also note that a common-dimension projection ablation could be explored in follow-up work. These changes will make the attribution of gains more defensible without altering the core empirical findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of external pre-trained encoders

full rationale

The paper conducts a controlled empirical study by extracting features from various pre-trained visual encoders (ResNet-50, I3D, DINOv3, V-JEPA2) and feeding them into a fixed MS-TCN++ temporal model under identical training settings on the SICS-155 dataset. No mathematical derivations, predictions, or first-principles results are claimed; performance metrics (e.g., 83.4% accuracy for DINOv3 ViT-7B) arise directly from experimental evaluation of externally sourced models. The setup uses cached features to decouple encoding from temporal learning, with no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The analysis is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of transfer learning and the representativeness of the SICS-155 dataset; no new entities or ad-hoc parameters are introduced beyond the choice of MS-TCN++ architecture.

axioms (2)
  • domain assumption Pre-trained foundation model features transfer meaningfully to surgical video without domain-specific fine-tuning of the encoder itself.
    Invoked when the authors freeze encoders and only train the temporal head.
  • domain assumption The SICS-155 dataset with 19 phases is a fair proxy for real-world small-incision cataract surgery variability.
    Used to generalize the reported accuracy and edit scores.

pith-pipeline@v0.9.0 · 5520 in / 1326 out tokens · 31756 ms · 2026-05-10T16:31:20.500371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Challenges in surgical video annota- tion,

    T. M. Ward, D. M. Fer, Y . Ban, G. Rosman, O. R. Meireles, and D. A. Hashimoto, “Challenges in surgical video annota- tion,”Computer Assisted Surgery, vol. 26, no. 1, pp. 58–68,

  2. [2]

    Cataract surgery for the developing world,

    G. Tabin, M. Michael Chen, and L. Espandar, “Cataract surgery for the developing world,”Current Opinion in Oph- thalmology, vol. 19, no. 1, pp. 55–59, 2008. 1

  3. [3]

    Manual small incision cataract surgery: A review,

    R. Venkatesh, D. F. Chang, R. Muralikrishnan, K. Hemal, P. Gogate, and S. Sengupta, “Manual small incision cataract surgery: A review,”Asia-Pacific Journal of Ophthalmology, vol. 1, no. 2, pp. 113–119, 2012. 1

  4. [4]

    Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,

    S. Li, Y . A. Farha, Y . Liu, M.-M. Cheng, and J. Gall, “Ms- tcn++: Multi-stage temporal convolutional network for action segmentation,” 2020. 1, 3, 4

  5. [5]

    Sics-155: Phase recognition in small incision cataract surgery videos,

    S. Mueller, “Sics-155: Phase recognition in small incision cataract surgery videos,” 2025. International Conference on Medical Image Computing and Computer-Assisted Interven- tion 2025 (MICCAI 2025), Daejeon, Republic of Korea. 1, 2

  6. [6]

    Lora: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021. 2, 4

  7. [7]

    Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,

    N. Ghamsarian, Y . El-Shabrawi, S. Nasirihaghighi, D. Putzgruber-Adamitsch, M. Zinkernagel, S. Wolf, K. Schoeffmann, and R. Sznitman, “Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos,” Scientific Data, vol. 11, no. 1, p. 373, 2024. 2

  8. [8]

    Cataract-101: Video dataset of 101 cataract surgeries,

    K. Schoeffmann, M. Taschwer, S. Sarny, B. M ¨unzer, M. J. Primus, and D. Putzgruber, “Cataract-101: Video dataset of 101 cataract surgeries,” inProceedings of the 9th ACM Multimedia Systems Conference, pp. 421–425, 2018. 2

  9. [9]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, 2009. 2, 4

  10. [10]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2018. 2

  11. [11]

    Sim´eoni, H

    O. Sim´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamon- jisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “Dinov3,” 2025. 2

  12. [12]

    V-jepa 2: Self-supervised video models enable understanding, predic- tion and planning,

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supervised...

  13. [13]

    Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,

    S. Mueller, B. Sachdeva, S. N. Prasad, R. Lechtenboehmer, F. G. Holz, R. P. Finger, K. Murali, M. Jain, M. W. M. Win- tergerst, and T. Schultz, “Phase recognition in manual small- incision cataract surgery with MS-TCN++ on the novel SICS- 105 dataset,”Scientific Reports, vol. 15, no. 1, 2025. 4