pith. sign in

arxiv: 2605.19340 · v1 · pith:JAZ73U6Jnew · submitted 2026-05-19 · 💻 cs.CV

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

Pith reviewed 2026-05-20 06:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-domain few-shot segmentationvision foundation modelslayer selectionregularizationcalibrationdomain adaptationsemantic segmentationfew-shot learning
0
0 comments X

The pith

HERA adapts vision foundation models for cross-domain few-shot segmentation by selecting the best layer, regularizing interactions, and calibrating predictions with minimal parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hierarchical Exemplar Representation Adaptation (HERA) to tackle the challenges of limited labeled exemplars and domain shifts in cross-domain few-shot semantic segmentation. It uses a three-stage process: selecting the most informative layer from a vision foundation model based on Exemplar Transfer Risk, regularizing the representation with prior guidance, and calibrating pixel-wise predictions. This allows the method to surpass existing approaches by over 4.1 mIoU on benchmarks while updating less than 2.7 percent of the model's parameters at test time. A reader would care because it makes large pre-trained vision models practical for specialized tasks in new domains without extensive retraining.

Core claim

HERA is a select-regularize-calibrate framework that first identifies the most informative VFM layer using a data-dependent Exemplar Transfer Risk metric computed for each candidate layer, then applies Prior-Guided Regularization to yield well-structured local signals, and finally uses Pixelwise Adaptive Calibration to combine the representation with refined maps for consistent masks, all while keeping the VFM frozen and fine-tuning only a small fraction of parameters.

What carries the argument

The Hierarchical Layer Selection (HLS) using Exemplar Transfer Risk (ETR) to pick the best layer, combined with Prior-Guided Regularization (PGR) and Pixelwise Adaptive Calibration (PAC) in the HERA pipeline.

If this is right

  • Adaptive layer selection avoids overfitting to small numbers of labeled exemplars.
  • Regularization produces structured local signals that improve subsequent calibration.
  • Calibration ensures consistent masks across the target domain.
  • Overall performance exceeds state of the art by more than 4.1 mIoU on multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar select-regularize-calibrate strategies could be applied to other vision tasks like object detection or instance segmentation in cross-domain settings.
  • The reliance on a single selected layer suggests that not all layers in foundation models are equally useful for novel domains, which might guide future model design.
  • Extending the ETR metric to evaluate combinations of layers rather than single ones could further improve adaptation.

Load-bearing premise

The data-dependent Exemplar Transfer Risk metric can accurately identify the single most informative layer from the vision foundation model despite the small number of target-domain exemplars and potential domain-specific noise.

What would settle it

Running the benchmarks with random layer selection instead of ETR-based selection and observing whether the mIoU gain disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.19340 by Junyuan Ma, Qi Fan, Wenbin Li, Xunzhi Xiang, Yang Gao.

Figure 1
Figure 1. Figure 1: Scarce labels and target domain shift co-occur in CD [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HERA architecture. Hierarchical Layer Selection (HLS) estimates the leave-one-out layer risk [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layerwise variability in VFM features (DINOv3 example). Foreground-logit heatmaps from ViT layers 0-23 for two episodes [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prior Guided Regularization (PGR). Per-head Gaussian [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on the Chest X-ray, ISIC, FSS-1000, and Deepglobe datasets under the 1-shot setting. The prediction and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional layer-wise probability maps demonstrating [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise predicted foreground probability maps [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise probability visualisation for another 1-shot [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate framework for cross-domain few-shot semantic segmentation (CD-FSS) that adapts frozen vision foundation models (VFMs) using only 1-5 labeled exemplars per novel class. The stages are: (1) Hierarchical Layer Selection (HLS) that computes a data-dependent Exemplar Transfer Risk (ETR) to identify the single most informative VFM layer; (2) Prior-Guided Regularization (PGR) that regularizes interactions on the selected representation; and (3) Pixelwise Adaptive Calibration (PAC) that combines the representation with refined interaction maps for final masks. The method fine-tunes <2.7% of parameters at test time and reports >4.1 mIoU gains over prior SOTA across multiple CD-FSS benchmarks.

Significance. If the results hold under rigorous validation, the work would be a meaningful engineering contribution to parameter-efficient VFM adaptation for CD-FSS. The hierarchical pipeline directly targets the dual challenges of overfitting on tiny support sets and layer-wise sensitivity to domain shift without requiring source-domain retraining. The reported parameter efficiency (<2.7%) and the explicit design of ETR, PGR, and PAC as modular stages are strengths that could influence follow-up work on selective feature use in foundation models.

major comments (2)
  1. §3.1 (Hierarchical Layer Selection): The headline performance claim depends on ETR computed solely on the 1-5 support exemplars reliably identifying the layer with best transfer to novel classes under domain shift. With such small sample sizes, ETR is vulnerable to exemplar-specific noise or spurious correlations; the manuscript must report whether ETR layer rankings remain stable under support-set resampling (e.g., bootstrap or leave-one-out) and whether they correlate with oracle layer performance measured on held-out target-domain validation data.
  2. Experimental evaluation (throughout §4): The abstract asserts >4.1 mIoU gains and <2.7% parameter fine-tuning, yet the provided description contains no details on experimental protocol, number of runs, statistical significance, variance across random support sets, or full ablation tables isolating the contribution of HLS, PGR, and PAC. These omissions leave the central empirical claim only partially supported and require addition of standard few-shot reporting practices (e.g., mean±std over 5-10 seeds, ablation on ETR vs. fixed-layer baselines).
minor comments (2)
  1. Notation: Define the precise mathematical form of ETR (including any hyperparameters) in the main text rather than deferring entirely to supplementary material, as it is load-bearing for reproducibility.
  2. Figure 1 (pipeline diagram): Add explicit arrows or labels showing how the output of HLS feeds into PGR and then PAC to clarify the hierarchical flow for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We have addressed each major point below and revised the manuscript to incorporate additional analyses and reporting as requested.

read point-by-point responses
  1. Referee: [—] §3.1 (Hierarchical Layer Selection): The headline performance claim depends on ETR computed solely on the 1-5 support exemplars reliably identifying the layer with best transfer to novel classes under domain shift. With such small sample sizes, ETR is vulnerable to exemplar-specific noise or spurious correlations; the manuscript must report whether ETR layer rankings remain stable under support-set resampling (e.g., bootstrap or leave-one-out) and whether they correlate with oracle layer performance measured on held-out target-domain validation data.

    Authors: We agree that assessing the stability of ETR with small support sets is a valid concern. In the revised manuscript, we have added a new analysis in §3.1 using bootstrap resampling (100 iterations) and leave-one-out validation across the support exemplars on all benchmarks. The results indicate that the top-ranked layer by ETR remains consistent in over 75% of resamples, with a Pearson correlation of 0.72 to the oracle best-performing layer evaluated on held-out target-domain validation splits. These findings are reported in a new paragraph and Table S2 of the supplementary material. revision: yes

  2. Referee: [—] Experimental evaluation (throughout §4): The abstract asserts >4.1 mIoU gains and <2.7% parameter fine-tuning, yet the provided description contains no details on experimental protocol, number of runs, statistical significance, variance across random support sets, or full ablation tables isolating the contribution of HLS, PGR, and PAC. These omissions leave the central empirical claim only partially supported and require addition of standard few-shot reporting practices (e.g., mean±std over 5-10 seeds, ablation on ETR vs. fixed-layer baselines).

    Authors: We acknowledge the need for more rigorous experimental reporting. The revised manuscript now includes: a detailed protocol description in §4.1 specifying 5 random support-set seeds per novel class; mean±std results over these runs for all methods; paired t-test p-values confirming statistical significance of the >4.1 mIoU gains; and expanded ablation tables (Tables 3 and 4) that isolate HLS (including ETR vs. fixed-layer baselines), PGR, and PAC contributions. Variance across random support sets is now explicitly shown in all main tables. revision: yes

Circularity Check

0 steps flagged

No circularity: independent engineering framework with no self-referential derivations

full rationale

The paper describes HERA as a three-stage select-regularize-calibrate pipeline (HLS via data-dependent ETR, PGR, PAC) for adapting frozen VFMs to CD-FSS. No equations, derivations, or first-principles results are shown that reduce any claimed quantity to a fitted parameter or self-defined input by construction. Performance claims rest on experimental benchmarks rather than mathematical reductions. The method is presented as an applied contribution that fine-tunes <2.7% parameters, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained VFM layers contain transferable information for novel classes under domain shift and that the introduced selection and calibration stages can extract it effectively from few exemplars.

axioms (1)
  • domain assumption Vision foundation models pre-trained on large-scale data contain layers whose features remain useful for novel classes even after domain shifts.
    This underpins the decision to select rather than fully retrain the VFM.

pith-pipeline@v0.9.0 · 5843 in / 1324 out tokens · 44770 ms · 2026-05-20T06:41:50.845176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Piantanida, Ismail Ben Ayed, and Jose Dolz. Few-shot seg- mentation without meta-learning: A good transductive infer- ence is all you need? InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13979–13988, 2021. 4, 6

  2. [2]

    Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration.IEEE transac- tions on medical imaging, 33(2):577–590, 2013

    Sema Candemir, Stefan Jaeger, Kannappan Palaniappan, Jonathan P Musco, Rahul K Singh, Zhiyun Xue, Alexandros Karargyris, Sameer Antani, George Thoma, and Clement J McDonald. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration.IEEE transac- tions on medical imaging, 33(2):577–590, 2013. 6, 1

  3. [3]

    Pixel matching network for cross-domain few- shot segmentation

    Hao Chen, Yonghan Dong, Zheming Lu, Yunlong Yu, and Jungong Han. Pixel matching network for cross-domain few- shot segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 978– 987, 2024. 6

  4. [4]

    Cross-domain few-shot semantic segmentation via doubly matching transformation

    Jiayi Chen, Rong Quan, and Jie Qin. Cross-domain few-shot semantic segmentation via doubly matching transformation. arXiv preprint arXiv:2405.15265, 2024. 2

  5. [5]

    Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 3

  6. [6]

    Vision trans- former adapter for dense predictions,

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions.arXiv preprint arXiv:2205.08534, 2022. 3

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2

  8. [8]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the interna- tional skin imaging collaboration (isic).arXiv preprint arXiv:1902.03368, 2019. 6, 1, 2

  9. [9]

    Deepglobe 2018: A challenge to parse the earth through satellite images

    Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 172–181, 2018. 6, 1, 2

  10. [10]

    Few-shot semantic segmen- tation with prototype learning

    Nanqing Dong and Eric P Xing. Few-shot semantic segmen- tation with prototype learning. InBMVC, page 4, 2018. 2

  11. [11]

    The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010. 6

  12. [12]

    Self- support few-shot semantic segmentation

    Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self- support few-shot semantic segmentation. InEuropean con- ference on computer vision, pages 701–719. Springer, 2022. 1, 2, 6

  13. [13]

    Adapt- ing in-domain few-shot segmentation to new domains with- out retraining.arXiv preprint arXiv:2504.21414, 2025

    Qi Fan, Kaiqi Liu, Nian Liu, Hisham Cholakkal, Rao Muhammad Anwer, Wenbin Li, and Yang Gao. Adapt- ing in-domain few-shot segmentation to new domains with- out retraining.arXiv preprint arXiv:2504.21414, 2025. 2

  14. [14]

    Eva: Exploring the limits of masked visual representa- tion learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 19358–19369, 2023. 2

  15. [15]

    Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

  16. [16]

    Dat- acomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Sys- tems, 36:27092–27112, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Sys- tems, 36:27092–27112, 2023. 6

  17. [17]

    Note: Robust continual test- time adaptation against temporal correlation.Advances in Neural Information Processing Systems, 35:27253–27266,

    Taesik Gong, Jongheon Jeong, Taewon Kim, Yewon Kim, Jinwoo Shin, and Sung-Ju Lee. Note: Robust continual test- time adaptation against temporal correlation.Advances in Neural Information Processing Systems, 35:27253–27266,

  18. [18]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608,

  19. [19]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 4

  20. [20]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

  21. [21]

    Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation

    Weizhao He, Yang Zhang, Wei Zhuo, Linlin Shen, Jiaqi Yang, Songhe Deng, and Liang Sun. Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23762–23772, 2024. 2, 6

  22. [22]

    Adapt before comparison: A new perspective on cross-domain few-shot segmentation

    Jonas Herzog. Adapt before comparison: A new perspective on cross-domain few-shot segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23605–23615, 2024. 1, 2, 6

  23. [23]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 3

  24. [24]

    Automatic tuberculosis screening using chest radio- graphs.IEEE transactions on medical imaging, 33(2):233– 245, 2013

    Stefan Jaeger, Alexandros Karargyris, Sema Candemir, Les Folio, Jenifer Siegelman, Fiona Callaghan, Zhiyun Xue, Kannappan Palaniappan, Rahul K Singh, Sameer Antani, et al. Automatic tuberculosis screening using chest radio- graphs.IEEE transactions on medical imaging, 33(2):233– 245, 2013. 6, 1

  25. [25]

    Tinytta: Efficient test-time adaptation via early-exit ensembles on edge de- vices.Advances in Neural Information Processing Systems, 37:43274–43299, 2024

    Hong Jia, Young Kwon, Alessio Orsino, Ting Dang, Domenico Talia, and Cecilia Mascolo. Tinytta: Efficient test-time adaptation via early-exit ensembles on edge de- vices.Advances in Neural Information Processing Systems, 37:43274–43299, 2024. 3

  26. [26]

    Membn: Robust test-time adaptation via batch norm with statistics memory

    Juwon Kang, Nayeong Kim, Jungseul Ok, and Suha Kwak. Membn: Robust test-time adaptation via batch norm with statistics memory. InEuropean Conference on Computer Vi- sion, pages 467–483. Springer, 2024. 3

  27. [27]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

  28. [28]

    Learning what not to segment: A new perspective on few- shot segmentation

    Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few- shot segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8057–8067, 2022. 2

  29. [29]

    Base and meta: A new perspective on few-shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10669–10686, 2023

    Chunbo Lang, Gong Cheng, Binfei Tu, Chao Li, and Jun- wei Han. Base and meta: A new perspective on few-shot segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10669–10686, 2023. 2

  30. [30]

    Surgical fine-tuning improves adaptation to distribution shifts.arXiv preprint arXiv:2210.11466, 2022

    Yoonho Lee, Annie S Chen, Fahim Tajwar, Ananya Ku- mar, Huaxiu Yao, Percy Liang, and Chelsea Finn. Surgical fine-tuning improves adaptation to distribution shifts.arXiv preprint arXiv:2210.11466, 2022. 2, 5

  31. [31]

    Cross-domain few-shot se- mantic segmentation

    Shuo Lei, Xuchao Zhang, Jianfeng He, Fanglan Chen, Bowen Du, and Chang-Tien Lu. Cross-domain few-shot se- mantic segmentation. InEuropean conference on computer vision, pages 73–90. Springer, 2022. 2, 6

  32. [32]

    Adaptive prototype learning and allocation for few-shot segmentation

    Gen Li, Varun Jampani, Laura Sevilla-Lara, Deqing Sun, Jonghyun Kim, and Joongkyu Kim. Adaptive prototype learning and allocation for few-shot segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8334–8343, 2021. 2

  33. [33]

    Fss-1000: A 1000-class dataset for few- shot segmentation

    Xiang Li, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, and Chi-Keung Tang. Fss-1000: A 1000-class dataset for few- shot segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2869–2878, 2020. 6, 1

  34. [34]

    Dual-agent optimization framework for cross- domain few-shot segmentation

    Zhaoyang Li, Yuan Wang, Wangkai Li, Tianzhu Zhang, and Xiang Liu. Dual-agent optimization framework for cross- domain few-shot segmentation. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 9849–9859, 2025. 6, 7

  35. [35]

    A comprehensive sur- vey on test-time adaptation under distribution shifts.Interna- tional Journal of Computer Vision, 133(1):31–64, 2025

    Jian Liang, Ran He, and Tieniu Tan. A comprehensive sur- vey on test-time adaptation under distribution shifts.Interna- tional Journal of Computer Vision, 133(1):31–64, 2025. 2, 3

  36. [36]

    Textual and visual guided task adaptation for source-free cross-domain few-shot segmentation

    Jianming Liu, Wenlong Qiu, and Haitao Wei. Textual and visual guided task adaptation for source-free cross-domain few-shot segmentation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5150–5159,

  37. [37]

    The devil is in low-level features for cross-domain few-shot seg- mentation

    Yuhan Liu, Yixiong Zou, Yuhua Li, and Ruixuan Li. The devil is in low-level features for cross-domain few-shot seg- mentation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 4618–4627, 2025. 2, 6, 7

  38. [38]

    Simpler is better: Few-shot semantic seg- mentation with classifier weight transformer

    Zhihe Lu, Sen He, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. Simpler is better: Few-shot semantic seg- mentation with classifier weight transformer. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8741–8750, 2021. 2

  39. [39]

    Hypercorrela- tion squeeze for few-shot segmentation

    Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorrela- tion squeeze for few-shot segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 6941–6952, 2021. 1, 2, 6

  40. [40]

    Cross-domain few-shot segmentation via iterative support-query correspon- dence mining

    Jiahao Nie, Yun Xing, Gongjie Zhang, Pei Yan, Aoran Xiao, Yap-Peng Tan, Alex C Kot, and Shijian Lu. Cross-domain few-shot segmentation via iterative support-query correspon- dence mining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3380– 3390, 2024. 1, 2, 6

  41. [41]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 8, 6, 7

  42. [42]

    Hierarchical dense cor- relation distillation for few-shot segmentation

    Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, and Jiaya Jia. Hierarchical dense cor- relation distillation for few-shot segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23641–23651, 2023. 2

  43. [43]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 6

  44. [44]

    Do vision trans- formers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128,

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128,

  45. [45]

    Levi: generalizable fine-tuning via layer-wise ensemble of different views.arXiv preprint arXiv:2402.04644, 2024

    Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H Chi, et al. Levi: generalizable fine-tuning via layer-wise ensemble of different views.arXiv preprint arXiv:2402.04644, 2024. 2, 5

  46. [46]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3, 7, 6

  47. [47]

    Domain-rectifying adapter for cross-domain few-shot segmentation

    Jiapeng Su, Qi Fan, Wenjie Pei, Guangming Lu, and Fanglin Chen. Domain-rectifying adapter for cross-domain few-shot segmentation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 24036– 24045, 2024. 2, 6

  48. [48]

    Prior guided feature enrich- ment network for few-shot segmentation.IEEE transactions on pattern analysis and machine intelligence, 44(2):1050– 1065, 2020

    Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich- ment network for few-shot segmentation.IEEE transactions on pattern analysis and machine intelligence, 44(2):1050– 1065, 2020. 2, 6

  49. [49]

    Lightweight frequency masker for cross-domain few-shot se- mantic segmentation.Advances in Neural Information Pro- cessing Systems, 37:96728–96749, 2024

    Jintao Tong, Yixiong Zou, Yuhua Li, and Ruixuan Li. Lightweight frequency masker for cross-domain few-shot se- mantic segmentation.Advances in Neural Information Pro- cessing Systems, 37:96728–96749, 2024. 1, 2, 6

  50. [50]

    Adapter naturally serves as decoupler for cross-domain few-shot semantic segmentation.arXiv preprint arXiv:2506.07376, 2025

    Jintao Tong, Ran Ma, Yixiong Zou, Guangyao Chen, Yuhua Li, and Ruixuan Li. Adapter naturally serves as decoupler for cross-domain few-shot semantic segmentation.arXiv preprint arXiv:2506.07376, 2025. 6, 7

  51. [51]

    Self-disentanglement and re-composition for cross-domain few-shot segmentation.arXiv preprint arXiv:2506.02677, 2025

    Jintao Tong, Yixiong Zou, Guangyao Chen, Yuhua Li, and Ruixuan Li. Self-disentanglement and re-composition for cross-domain few-shot segmentation.arXiv preprint arXiv:2506.02677, 2025. 2, 6, 7

  52. [52]

    The ham10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci- entific data, 5(1):1–9, 2018

    Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci- entific data, 5(1):1–9, 2018. 6, 1, 2

  53. [53]

    Tent: Fully Test-time Adaptation by Entropy Minimization

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726,

  54. [54]

    Panet: Few-shot image semantic seg- mentation with prototype alignment

    Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic seg- mentation with prototype alignment. Inproceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019. 1, 6

  55. [55]

    Continual test-time domain adaptation

    Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022. 3

  56. [56]

    Adap- tive agent transformer for few-shot segmentation

    Yuan Wang, Rui Sun, Zhe Zhang, and Tianzhu Zhang. Adap- tive agent transformer for few-shot segmentation. InEuro- pean conference on computer vision, pages 36–52. Springer,

  57. [57]

    A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024

    Jialu Xing, Jianping Liu, Jian Wang, Lulu Sun, Xi Chen, Xunxun Gu, and Yingfei Wang. A survey of efficient fine- tuning methods for vision-language models—prompt and adapter.Computers & Graphics, 119:103885, 2024. 3

  58. [58]

    Prototype mixture models for few-shot semantic segmentation

    Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qix- iang Ye. Prototype mixture models for few-shot semantic segmentation. InEuropean conference on computer vision, pages 763–778. Springer, 2020. 6

  59. [59]

    Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation

    Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo, Qingyao Wu, and Rui Yao. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9587–9595,

  60. [60]

    Canet: Class-agnostic segmentation networks with it- erative refinement and attentive few-shot learning

    Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen. Canet: Class-agnostic segmentation networks with it- erative refinement and attentive few-shot learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5217–5226, 2019. 6

  61. [61]

    Few-shot segmentation via cycle-consistent trans- former.Advances in neural information processing systems, 34:21984–21996, 2021

    Gengwei Zhang, Guoliang Kang, Yi Yang, and Yunchao Wei. Few-shot segmentation via cycle-consistent trans- former.Advances in neural information processing systems, 34:21984–21996, 2021. 1