pith. sign in

arxiv: 2604.04571 · v1 · submitted 2026-04-06 · 💻 cs.CV

TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis

Pith reviewed 2026-05-10 19:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords parameter-efficient fine-tuningfoundation modelsOCTOCTAretinal layer segmentationdomain adaptationmasked image modeling
0
0 comments X

The pith

TAPE decouples domain alignment from task fitting to adapt foundation models for OCT-OCTA segmentation with few parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TAPE as a two-stage parameter-efficient fine-tuning framework to transfer foundation models to automated analysis of OCT and OCTA images. It first aligns the model to the medical imaging domain using masked image modeling with PEFT, then fits the model to retinal layer segmentation. This separation targets domain shift and task misalignment that block direct use of large pretrained models in clinical eye imaging. A reader would care because the approach promises high performance on diverse pathologies while updating far fewer parameters than full fine-tuning, easing deployment where data and compute are limited.

Core claim

TAPE strategically decouples adaptation into domain alignment and task fitting for downstream segmentation. The domain adaptation stage applies parameter-efficient fine-tuning in the context of masked image modeling for medical image domain adaptation. Applying TAPE to retinal layer segmentation on both universal (MAE) and specialized (RETFound) foundation models demonstrates superior parameter efficiency and state-of-the-art generalization performance across diverse pathologies.

What carries the argument

Two-stage decoupling via parameter-efficient fine-tuning, with the first stage performing domain alignment through masked image modeling and the second stage handling task-specific fitting for segmentation.

If this is right

  • Foundation models can be adapted to OCT-OCTA retinal segmentation while updating only a small fraction of parameters.
  • The method achieves state-of-the-art generalization across multiple eye pathologies.
  • The same two-stage structure works for both general-purpose and retina-specialized foundation models.
  • Masked image modeling with PEFT serves as an effective first-stage domain aligner for medical images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling idea could be tried on other medical imaging tasks such as lesion detection or disease classification.
  • It points to a possible way to lower the data and compute barriers for deploying foundation models in additional clinical specialties facing domain gaps.
  • Further tests on multi-center OCT datasets would check whether the efficiency gains hold when scanner types and patient populations vary more widely.

Load-bearing premise

Decoupling domain alignment via PEFT in masked image modeling from later task fitting will overcome domain shift and task misalignment without adding performance losses or needing heavy validation on new data distributions.

What would settle it

A head-to-head test on an unseen pathology dataset where the TAPE-adapted model matches or underperforms a single-stage PEFT baseline while using similar or greater numbers of updated parameters.

read the original abstract

Automated analysis of optical coherence tomography (OCT) and OCT angiography (OCTA) images is critical for robust ophthalmic diagnosis. Existing mainstream methods trained from scratch rely heavily on massive data and model scale, thereby hindering their practical deployment in resource-constrained clinical settings. Although transfer learning based on foundation models (FMs) is promising, it still faces significant challenges: domain shift and task misalignment. To address these, we propose TAPE: A Two-stage Adaptation Framework via Parameter-Efficient Fine-tuning, which strategically decouples adaptation into domain alignment and task fitting for downstream segmentation. The domain adaptation stage notably applies parameter-efficient fine-tuning (PEFT) in the context of masked image modeling for medical image domain adaptation, a novel approach to the best of our knowledge. Applying TAPE to retinal layer segmentation on both universal (masked auto-encoder, MAE) and specialized (RETFound) FMs, it demonstrates superior parameter efficiency and achieves state-of-the-art generalization performance across diverse pathologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents TAPE, a two-stage parameter-efficient adaptation framework for foundation models in OCT and OCTA image analysis. It decouples the adaptation process into a domain alignment stage, which uses parameter-efficient fine-tuning (PEFT) within a masked image modeling paradigm, and a subsequent task fitting stage for downstream tasks such as retinal layer segmentation. Experiments on universal (MAE) and specialized (RETFound) foundation models demonstrate improved parameter efficiency and state-of-the-art generalization performance across various pathologies.

Significance. If the results hold, this work could have significant impact on deploying foundation models in resource-limited clinical environments by reducing the need for extensive retraining while handling domain shifts common in medical imaging. The approach of applying PEFT to masked image modeling for domain adaptation appears novel and could inspire similar strategies in other modalities. Credit is due for evaluating on both general and domain-specific FMs. However, the overall significance depends on confirming that the two-stage design provides additive benefits beyond standard PEFT applications.

major comments (2)
  1. Section 4 (Experimental Results): No ablation study isolates the contribution of the domain alignment stage. Specifically, there is no comparison of TAPE against a single-stage PEFT baseline where the foundation model is directly adapted to the segmentation task using the same PEFT modules and hyperparameters. This is a load-bearing issue for the central claim that the decoupling 'strategically' addresses domain shift and task misalignment, as any reported improvements could be attributable to other factors such as the choice of PEFT method or training protocol.
  2. §3.2 Domain Alignment Stage: The description of how PEFT is integrated into masked image modeling for domain adaptation lacks details on the specific PEFT technique (e.g., LoRA, Adapter) and the masking strategy tailored for OCT-OCTA. Without these, reproducibility is hindered, and it is unclear how this stage differs from standard self-supervised adaptation.
minor comments (2)
  1. The abstract claims 'superior parameter efficiency' but does not quantify the number of trainable parameters or compare FLOPs; including these metrics in the abstract or a summary table would strengthen the presentation.
  2. Table 1: Ensure that all baselines are fairly implemented with the same data splits and augmentation strategies as TAPE to avoid confounding factors in the SOTA claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful and constructive comments on our manuscript. We address each major comment below with clarifications and will revise the paper to incorporate additional experiments and details as outlined.

read point-by-point responses
  1. Referee: Section 4 (Experimental Results): No ablation study isolates the contribution of the domain alignment stage. Specifically, there is no comparison of TAPE against a single-stage PEFT baseline where the foundation model is directly adapted to the segmentation task using the same PEFT modules and hyperparameters. This is a load-bearing issue for the central claim that the decoupling 'strategically' addresses domain shift and task misalignment, as any reported improvements could be attributable to other factors such as the choice of PEFT method or training protocol.

    Authors: We agree that an ablation isolating the domain alignment stage is necessary to substantiate the benefits of decoupling. In the revised manuscript, we will add a direct comparison of TAPE against a single-stage PEFT baseline that applies identical PEFT modules and hyperparameters to the segmentation task without the domain alignment stage. This will report segmentation metrics, parameter efficiency, and generalization results across datasets and both foundation models (MAE and RETFound), allowing readers to assess the additive value of the two-stage design. revision: yes

  2. Referee: §3.2 Domain Alignment Stage: The description of how PEFT is integrated into masked image modeling for domain adaptation lacks details on the specific PEFT technique (e.g., LoRA, Adapter) and the masking strategy tailored for OCT-OCTA. Without these, reproducibility is hindered, and it is unclear how this stage differs from standard self-supervised adaptation.

    Authors: We thank the referee for highlighting the need for greater specificity. In the revised Section 3.2, we will detail the exact PEFT technique used (including type, rank, scaling factors, and adapted modules), the masking ratio, and any OCT-OCTA-specific adaptations to the masking strategy that emphasize domain-relevant structures. We will also clarify distinctions from standard self-supervised adaptation by explaining how PEFT enables efficient domain alignment prior to task fitting, thereby supporting reproducibility and the claimed novelty. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological proposal with empirical claims only

full rationale

The paper introduces TAPE as a design choice (two-stage decoupling of domain alignment via PEFT in masked image modeling followed by task fitting) and reports empirical results on MAE and RETFound for retinal layer segmentation. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described framework. Claims of parameter efficiency and SOTA generalization are presented as outcomes of applying the framework, not as quantities forced by construction from the inputs. The proposal is self-contained as an engineering method without load-bearing reductions to self-definition or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on specific free parameters, axioms, or invented entities; the framework description does not detail any fitted values, unproved assumptions, or new postulated components.

pith-pipeline@v0.9.0 · 5482 in / 1188 out tokens · 41987 ms · 2026-05-10T19:14:09.769809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    TAPE: A two-stage parameter-efficient adaptation framework for foundation models in OCT-OCTA analysis

    INTRODUCTION Optical coherence tomography (OCT) and OCT angiography (OCTA) are two important fundus imaging modalities. The former reflects clear anatomical information of the retinal layer, while the latter contains blood flow function informa- tion closely related to the occurrence of diseases. In clinical practice, OCT and OCTA are often used to identi...

  2. [2]

    We propose a novelTwo-stageAdaptation Framework via Parameter-Efficient Fine-tuning (TAPE), specifically en- gineered to tackle the dual challenges of domain shift and task misalignment commonly encountered when transfer- ring foundation models

  3. [3]

    This work provides an early and com- prehensive validation of the efficacy of parameter-efficient fine-tuning (PEFT) methods within MIM for medical im- age analysis

    In the domain adaptation stage, we systematically com- pare the performance of various fine-tuning methods in the context of SSL. This work provides an early and com- prehensive validation of the efficacy of parameter-efficient fine-tuning (PEFT) methods within MIM for medical im- age analysis

  4. [4]

    Utilizing OCT-OCTA retinal layer segmentation as the downstream task in the task adaptation stage, we con- duct extensive experiments on both the MAE and RET- Found. Quantitative and qualitative results demonstrate that our framework significantly reduces computational resource dependency while achieving state-of-the-art layer segmentation performance, pa...

  5. [5]

    METHOD 2.1. Preliminaries: FMs and fine-tuning methods We examine two distinct FMs for comparison: MAE, a self- supervised ViT pre-trained on natural images (ImageNet- 1K), and RETFound, an ophthalmic domain-specific FM pre-trained on0.74million OCT scans. Fine-tuning methods broadly include full parameter fine- tuning (FFT) and PEFT. Given a pre-trained ...

  6. [6]

    Experimental setup We conducted our experiments using the OCTA-500 dataset

    EXPERIMENTS 3.1. Experimental setup We conducted our experiments using the OCTA-500 dataset. The subjects, encompassing four categories (AMD, DR, RVO, and NORMAL), were strictly and stratifiedly split into training, validation, and test datasets. All experiments were exclusively performed on a single NVIDIA H100 GPU. 3.2. Experiment I: PEFT Strategy Selec...

  7. [7]

    For future work, we plan to extend TAPE along three directions

    CONCLUSION We proposed TAPE, a two-stage parameter-efficient adapta- tion framework that effectively integrates domain adaptation with task adaptation. For future work, we plan to extend TAPE along three directions. First, we will incorporate a wider range of OCT-OCTA analysis tasks, such as disease classification and OCT fluid segmentation. Second, we wi...

  8. [8]

    ACKNOWLEDGEMENTS This work is jointly supported by Open Fund of Nankai Uni- versity Optometry & Vision Science Institute (NKSGY2024 04), Shenzhen Science and Technology Program (JCYJ20240 813165501003), Science and Technology Program of Tianjin (23JCYBJC01240)

  9. [9]

    Eth- ical approval was not required as confirmed by the license attached with the open access data

    COMPLIANCE WITH ETHICAL STANDARDS This research study was conducted retrospectively using hu- man subject data made available in open access by [15]. Eth- ical approval was not required as confirmed by the license attached with the open access data

  10. [10]

    Optical coherence tomography angiography,

    Richard F Spaide, James G Fujimoto, Nadia K Waheed, Srinivas R Sadda, and Giovanni Staurenghi, “Optical coherence tomography angiography,”Progress in reti- nal and eye research, vol. 64, pp. 1–55, 2018

  11. [11]

    Oct-octa segmentation: combin- ing structural and blood flow information to segment bruch’s membrane,

    Julia Schottenhamml, Eric M Moult, Stefan B Ploner, Siyu Chen, Eduardo Novais, Lennart Husvogt, Jay S Duker, Nadia K Waheed, James G Fujimoto, and An- dreas K Maier, “Oct-octa segmentation: combin- ing structural and blood flow information to segment bruch’s membrane,”Biomedical Optics Express, vol. 12, no. 1, pp. 84–99, 2020

  12. [12]

    Relaynet: retinal layer and fluid segmentation of macular optical coherence to- mography using fully convolutional networks,

    Abhijit Guha Roy, Sailesh Conjeti, Sri Phani Kr- ishna Karri, Debdoot Sheet, Amin Katouzian, Christian Wachinger, and Nassir Navab, “Relaynet: retinal layer and fluid segmentation of macular optical coherence to- mography using fully convolutional networks,”Biomed- ical optics express, vol. 8, no. 8, pp. 3627–3642, 2017

  13. [13]

    Multi-scale gcn-assisted two- stage network for joint segmentation of retinal layers and discs in peripapillary oct images,

    Jiaxuan Li, Peiyao Jin, Jianfeng Zhu, Haidong Zou, Xun Xu, Min Tang, Minwen Zhou, Yu Gan, Jiangnan He, Yuye Ling, et al., “Multi-scale gcn-assisted two- stage network for joint segmentation of retinal layers and discs in peripapillary oct images,”Biomedical Op- tics Express, vol. 12, no. 4, pp. 2204–2220, 2021

  14. [14]

    Exploiting multi-granularity visual features for retinal layer seg- mentation in human eyes,

    Xiang He, Yiming Wang, Fabio Poiesi, Weiye Song, Quanqing Xu, Zixuan Feng, and Yi Wan, “Exploiting multi-granularity visual features for retinal layer seg- mentation in human eyes,”Frontiers in Bioengineering and Biotechnology, vol. 11, pp. 1191803, 2023

  15. [15]

    Retinal layer segmentation in oct images with boundary regression and feature polarization,

    Yubo Tan, Wen-Da Shen, Ming-Yuan Wu, Gui-Na Liu, Shi-Xuan Zhao, Yang Chen, Kai-Fu Yang, and Yong- Jie Li, “Retinal layer segmentation in oct images with boundary regression and feature polarization,”IEEE Transactions on Medical Imaging, vol. 43, no. 2, pp. 686–700, 2023

  16. [16]

    Msapnet: Multi-scale and multi- axial perception network for retinal layers and lesion segmentation,

    Wentao Yu, Chenggang Lu, Quanyong Yi, Jiong Zhang, and Caifeng Shan, “Msapnet: Multi-scale and multi- axial perception network for retinal layers and lesion segmentation,” in2025 IEEE 22nd International Sym- posium on Biomedical Imaging (ISBI). IEEE, 2025, pp. 1–5

  17. [17]

    Masked autoencoders are scalable vision learners,

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- otr Doll ´ar, and Ross Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009

  18. [18]

    A foundation model for generalizable disease detection from retinal images,

    Yukun Zhou, Mark A Chia, Siegfried K Wagner, Murat S Ayhan, Dominic J Williamson, Robbert R Struyven, Timing Liu, Moucheng Xu, Mateo G Lozano, Peter Woodward-Court, et al., “A foundation model for generalizable disease detection from retinal images,” Nature, vol. 622, no. 7981, pp. 156–163, 2023

  19. [19]

    How foundational is the retina foundation model? estimating retfound’s label ef- ficiency on binary classification of normal versus abnor- mal oct images,

    David Kuo, Qitong Gao, Dev Patel, Miroslav Pajic, and Majda Hadziahmetovic, “How foundational is the retina foundation model? estimating retfound’s label ef- ficiency on binary classification of normal versus abnor- mal oct images,”Ophthalmology Science, vol. 5, no. 3, pp. 100707, 2025

  20. [20]

    FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

    Zhenyi Zhao, Muthu Rama Krishnan Mookiah, and Emanuele Trucco, “Leveraging the retfound foundation model for optic disc segmentation in retinal images,” arXiv preprint arXiv:2508.11354, 2025

  21. [21]

    Lora: Low-rank adaptation of large lan- guage models.,

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large lan- guage models.,”ICLR, vol. 1, no. 2, pp. 3, 2022

  22. [22]

    Vision transformer adapter for dense predictions,

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao, “Vision transformer adapter for dense predictions,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

  23. [23]

    Visual prompt tuning,

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser- Nam Lim, “Visual prompt tuning,” inEuropean confer- ence on computer vision. Springer, 2022, pp. 709–727

  24. [24]

    Octa-500: A retinal dataset for optical coherence tomography angiography study,

    Mingchao Li, Kun Huang, Qiuzhuo Xu, Jiadong Yang, Yuhan Zhang, Zexuan Ji, Keren Xie, Songtao Yuan, Qinghuai Liu, and Qiang Chen, “Octa-500: A retinal dataset for optical coherence tomography angiography study,”Medical Image Analysis, vol. 93, pp. 103092, 2024