SegTTA: Training-Free Test-Time Augmentation for Zero-Shot Medical Imaging Segmentation
Pith reviewed 2026-05-10 05:33 UTC · model grok-4.3
The pith
SegTTA applies four augmentations and weighted voting across MedSAM2 checkpoints to raise zero-shot medical segmentation accuracy without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SegTTA shows that four augmentations (Gamma correction, Contrast enhancement, Gaussian blur, Gaussian noise) plus weighted voting across MedSAM2 checkpoints improve zero-shot segmentation without any model retraining. On the multiclass hepatic vessel dataset the method raises mIoU by 1.6 and aIoU by 1.9 while lowering HD95 by roughly 2.0 relative to the MedSAM2 baseline. Ablation studies confirm that large organs gain from intensity-based augmentations and small lesions gain from noise-based ones, while the voting threshold directly controls the coverage-precision trade-off for clinical needs.
What carries the argument
The SegTTA framework that applies four fixed augmentations to each test image and aggregates predictions via weighted voting across multiple MedSAM2 checkpoints.
If this is right
- Consistent accuracy gains appear across healthy uterus segmentation, uterine myoma detection, and multiclass hepatic structure segmentation.
- Intensity augmentations improve large-organ results while noise augmentations improve small-lesion results.
- Raising or lowering the voting threshold trades segmentation coverage for precision to match different clinical priorities.
- The same procedure reduces Hausdorff distance by about 2.0 on hepatic vessel data.
Where Pith is reading between the lines
- The method may transfer to other promptable segmentation models if those models also provide multiple checkpoints with complementary errors.
- Fixed augmentation weights may underperform on datasets with different noise or contrast profiles, suggesting future work on lightweight per-task calibration.
- Because no retraining occurs, the framework could serve as a quick post-processing step when deploying foundation models in new hospitals.
Load-bearing premise
The specific four augmentations combined with weighted voting across MedSAM2 checkpoints will produce consistent gains on unseen medical images without introducing new errors or requiring task-specific tuning of the weights and threshold.
What would settle it
A held-out medical imaging dataset on which the four-augmentation plus weighted-voting procedure yields no gain or a loss in mIoU, aIoU, or HD95 compared with the plain MedSAM2 baseline.
Figures
read the original abstract
Increasingly advanced data augmentation techniques have greatly aided clinical medical research, increasing data diversity and improving model generalization capabilities. Although most current basic models exhibit strong generalization abilities, image quality varies due to differences in equipment and operators. To address these challenges, we present SegTTA, a framework that improves medical image segmentation without model retraining by combining four augmentations (Gamma correction, Contrast enhancement, Gaussian blur, Gaussian noise) with weighted voting across multiple MedSAM2 checkpoints. Experiments demonstrate consistent improvements across three diverse datasets: healthy uterus segmentation, uterine myoma detection, and multi class hepatic structure segmentation. Ablation studies reveal that large organs benefit from intensity augmentations while small lesions require noise augmentations. The voting threshold controls the coverage precision trade off, enabling task specific optimization for different clinical requirements. Ultimately, on a multiclass hepatic vessel dataset, compared to MedSAM2 baselines, our method achieves an increase of 1.6 in mIoU and 1.9 in aIoU, along with a reduction of approximately 2.0 in HD95. Code will be available at https://github.com/AIGeeksGroup/SegTTA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SegTTA, a training-free test-time augmentation framework for zero-shot medical imaging segmentation. It applies four augmentations—Gamma correction, Contrast enhancement, Gaussian blur, and Gaussian noise—followed by weighted voting across multiple MedSAM2 checkpoints. The method is evaluated on three datasets: healthy uterus segmentation, uterine myoma detection, and multiclass hepatic structure segmentation, reporting consistent improvements such as +1.6 mIoU, +1.9 aIoU, and approximately -2.0 HD95 on the hepatic vessel dataset compared to MedSAM2 baselines. Ablation studies show that intensity augmentations benefit large organs while noise augmentations help small lesions, and the voting threshold allows for task-specific coverage-precision trade-offs.
Significance. If the reported gains can be achieved with a fixed, non-tuned configuration of augmentations and voting parameters, SegTTA would offer a practical, training-free way to enhance the performance of foundation models like MedSAM2 in clinical settings without additional data or retraining. The availability of code and experiments across diverse datasets strengthen the potential impact. However, the emphasis on task-specific optimization of the threshold suggests that the improvements may depend on per-dataset adjustments, which could reduce the method's generalizability in truly zero-shot scenarios.
major comments (3)
- [Abstract] Abstract: The central claim of consistent improvements in a training-free regime is qualified by the statement that 'the voting threshold controls the coverage precision trade off, enabling task specific optimization.' This raises a concern that the headline metrics (e.g., +1.6 mIoU on hepatic vessels) may result from dataset-specific tuning of weights and threshold rather than a single fixed setup, which would undermine the zero-shot test-time augmentation interpretation.
- [Abstract / Experiments] Abstract / Experiments: No details are provided on the exact values of augmentation parameters (e.g., gamma values, noise levels), the weighting scheme for voting, or how the threshold is selected. Without this information or evidence that a single set of parameters works across datasets, the reproducibility and generality of the gains cannot be assessed.
- [Ablation studies] Ablation studies: The differentiation between intensity and noise augmentations based on organ/lesion size is interesting, but without quantitative results on how these choices affect the final metrics or controls for the number of augmentations, it is unclear if the four-augmentation combination is optimal or if simpler subsets would suffice.
minor comments (1)
- [Abstract] The abstract mentions 'approximately 2.0 in HD95' but does not specify the baseline value or units, which would aid interpretation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below with clarifications and commitments to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of consistent improvements in a training-free regime is qualified by the statement that 'the voting threshold controls the coverage precision trade off, enabling task specific optimization.' This raises a concern that the headline metrics (e.g., +1.6 mIoU on hepatic vessels) may result from dataset-specific tuning of weights and threshold rather than a single fixed setup, which would undermine the zero-shot test-time augmentation interpretation.
Authors: We appreciate the referee's concern about potential ambiguity. The augmentation parameters and voting weights are fixed and identical for all experiments across the three datasets; only the optional threshold provides flexibility for coverage-precision trade-offs in different clinical contexts. The reported gains, including +1.6 mIoU and -2.0 HD95 on hepatic vessels, were obtained with a single fixed threshold value. We will revise the abstract to explicitly state that the primary results use a fixed, non-tuned configuration while noting the threshold as an optional control, thereby reinforcing the training-free and zero-shot nature of SegTTA. revision: yes
-
Referee: [Abstract / Experiments] Abstract / Experiments: No details are provided on the exact values of augmentation parameters (e.g., gamma values, noise levels), the weighting scheme for voting, or how the threshold is selected. Without this information or evidence that a single set of parameters works across datasets, the reproducibility and generality of the gains cannot be assessed.
Authors: This is a fair observation on reproducibility. In the revised manuscript we will add a new subsection (or table) specifying the exact augmentation parameters (gamma value, contrast factor, blur sigma, noise variance), the uniform weighting scheme used for voting, and the default threshold selection (with sensitivity analysis). We will also explicitly confirm and demonstrate that this identical fixed parameter set was applied to the healthy uterus, uterine myoma, and multiclass hepatic vessel datasets, producing the reported consistent improvements. revision: yes
-
Referee: [Ablation studies] Ablation studies: The differentiation between intensity and noise augmentations based on organ/lesion size is interesting, but without quantitative results on how these choices affect the final metrics or controls for the number of augmentations, it is unclear if the four-augmentation combination is optimal or if simpler subsets would suffice.
Authors: We agree that the ablation section would benefit from additional quantitative detail. We will expand the ablation studies in the revision to report per-augmentation and combinatorial metric changes (mIoU, aIoU, HD95) on each dataset, while controlling for the number of augmentations by comparing the full four-augmentation ensemble against intensity-only, noise-only, and other subsets. This will provide concrete evidence supporting the observed differential benefits for large organs versus small lesions and the overall utility of the chosen combination. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper presents an empirical test-time augmentation framework (four fixed augmentations plus weighted voting on MedSAM2 checkpoints) and reports metric gains on three held-out medical imaging datasets against independent baselines. No equations, derivations, or self-referential steps appear in the provided text that reduce any claimed result to a fitted input or self-citation by construction. The note on task-specific threshold optimization is presented as a practical feature rather than evidence that headline numbers were obtained via circular fitting on test data. The evaluation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- Augmentation parameters
- Voting weights and threshold
axioms (1)
- domain assumption Multiple MedSAM2 checkpoints produce sufficiently diverse and complementary predictions that weighted voting improves accuracy.
Reference graph
Works this paper leans on
-
[1]
Evgin Goceri. Medical image data augmentation: techniques, comparisons and interpretations.Artificial intelligence review, 56(11):12561–12605, 2023
work page 2023
-
[2]
Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation
Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, and Ekin D Cubuk. Improving robustness without sacrificing accuracy with patch gaussian augmentation.arXiv preprint arXiv:1906.02611, 2019
work page Pith review arXiv 1906
-
[3]
Mediaug: Exploring visual augmentation in medical imaging
Xuyin Qi, Zeyu Zhang, Canxuan Gang, Hao Zhang, Lei Zhang, Zhiwei Zhang, and Yang Zhao. Mediaug: Exploring visual augmentation in medical imaging. InAnnual Conference on Medical Image Understanding and Analysis, pages 218–232. Springer, 2025
work page 2025
-
[4]
Medsam2: Segment anything in 3d medical images and videos,
Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, and Bo Wang. Medsam2: Segment anything in 3d medical images and videos.arXiv preprint arXiv:2504.03600, 2025
-
[5]
Xiao Ma, Yuhui Tao, Yuhan Zhang, Zexuan Ji, Yizhe Zhang, and Qiang Chen. Test-time generative augmentation for medical image segmentation.arXiv preprint arXiv:2406.17608, 2024
-
[6]
Wasfieh Nazzal, Karl Thurnhofer-Hemsi, and Ezequiel López-Rubio. Improving medical image segmentation using test-time augmentation with medsam.Mathematics, 12(24):4003, 2024
work page 2024
-
[7]
Fathi Kallel, Mouna Sahnoun, Ahmed Ben Hamida, and Khalil Chtourou. Ct scan contrast enhancement using singular value decomposition and adaptive gamma correction.Signal, Image and Video Processing, 12(5):905–913, 2018
work page 2018
-
[8]
Erdal Tasci, Caner Uluturk, and Aybars Ugur. A voting-based ensemble deep learning method focusing on image augmentation and preprocessing variations for tuberculosis detection.Neural Computing and Applications, 33(22):15541–15555, 2021
work page 2021
-
[9]
Bhsd: A 3d multi-class brain hemorrhage segmentation dataset
Biao Wu, Yutong Xie, Zeyu Zhang, Jinchao Ge, Kaspar Yaxley, Suzan Bahadir, Qi Wu, Yifan Liu, and Minh-Son To. Bhsd: A 3d multi-class brain hemorrhage segmentation dataset. InInternational workshop on machine learning in medical imaging, pages 147–156. Springer, 2023
work page 2023
-
[10]
Shengbo Tan, Zeyu Zhang, Ying Cai, Daji Ergu, Lin Wu, Binbin Hu, Pengzhang Yu, and Yang Zhao. Segstitch: Multidimensional transformer for robust and efficient medical imaging segmentation.arXiv preprint arXiv:2408.00496, 2024
-
[11]
Thin-thick adapter: Segmenting thin scans using thick annotations
Zeyu Zhang, Bowen Zhang, Abhiram Hiwase, Christen Barras, Feng Chen, Biao Wu, Adam James Wells, Daniel Y Ellis, Benjamin Reddi, Andrew William Burgan, et al. Thin-thick adapter: Segmenting thin scans using thick annotations. 2023
work page 2023
-
[12]
Esa: Annotation-efficient active learning for semantic segmentation
Jinchao Ge, Zeyu Zhang, Vu Minh Hieu Phan, Bowen Zhang, Akide Liu, Yang Zhao, and Shuwen Zhao. Esa: Annotation-efficient active learning for semantic segmentation. InInternational Conference on Intelligent Computing, pages 141–152. Springer, 2025
work page 2025
-
[13]
Hongjie Zhu, Zeyu Zhang, Guansong Pang, Xu Wang, Shimin Wen, Yu Bai, Daji Ergu, Ying Cai, and Yang Zhao. Doei: Dual optimization of embedding information for attention-enhanced class activation maps.arXiv preprint arXiv:2502.15885, 2025
-
[14]
Ruicheng Zhang, Haowei Guo, Zeyu Zhang, Puxin Yan, and Shen Zhao. Gamed-snake: Gradient-aware adaptive momentum evolution deep snake model for multi-organ segmentation.arXiv preprint arXiv:2501.12844, 2025
-
[15]
Segkan: High-resolution medical image segmentation with long-distance dependencies
Shengbo Tan, Rundong Xue, Shipeng Luo, Zeyu Zhang, Xinran Wang, Lei Zhang, Daji Ergu, Zhang Yi, Yang Zhao, and Ying Cai. Segkan: High-resolution medical image segmentation with long-distance dependencies. arXiv preprint arXiv:2412.19990, 2024
-
[16]
Ruicheng Zhang, Yu Sun, Zeyu Zhang, Jinai Li, Xiaofan Liu, Au Hoi Fan, Haowei Guo, and Puxin Yan. Marl- mambacontour: Unleashing multi-agent deep reinforcement learning for active contour optimization in medical image segmentation.arXiv preprint arXiv:2506.18679, 2025
-
[17]
Ruicheng Zhang, Haowei Guo, Kanghui Tian, Jun Zhou, Mingliang Yan, Zeyu Zhang, and Shen Zhao. Unified medical image segmentation with state space modeling snake.arXiv preprint arXiv:2507.12760, 2025
-
[18]
Hongjie Zhu, Xiwei Liu, Rundong Xue, Zeyu Zhang, Yong Xu, Daji Ergu, Ying Cai, and Yang Zhao. Sss: Semi- supervised sam-2 with efficient prompting for medical imaging segmentation.arXiv preprint arXiv:2506.08949, 2025
-
[19]
Yanwu Yang, Guinan Su, Jiesi Hu, Francesco Sammarco, Jonas Geiping, and Thomas Wolfers. Medsamix: A training-free model merging approach for medical image segmentation.arXiv preprint arXiv:2508.11032, 2025. 11 SegTTA: Test-Time Augmentation for Medical Imaging Segmentation
-
[20]
Guohui Cai, Ruicheng Zhang, Hongyang He, Zeyu Zhang, Daji Ergu, Yuanzhouhan Cao, Jinman Zhao, Binbin Hu, Zhinbin Liao, Yang Zhao, et al. Msdet: Receptive field enhanced multiscale detection for tiny pulmonary nodule.arXiv preprint arXiv:2409.14028, 2024
-
[21]
Meddet: Generative adversarial distillation for efficient cervical disc herniation detection
Zeyu Zhang, Nengmin Yi, Shengbo Tan, Ying Cai, Yi Yang, Lei Xu, Qingtai Li, Zhang Yi, Daji Ergu, and Yang Zhao. Meddet: Generative adversarial distillation for efficient cervical disc herniation detection. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4024–4027. IEEE, 2024
work page 2024
-
[22]
Rui Zhao, Zeyu Zhang, Yi Xu, Yi Yao, Yan Huang, Wenxin Zhang, Zirui Song, Xiuying Chen, and Yang Zhao. Peddet: Adaptive spectral optimization for multimodal pedestrian detection.arXiv preprint arXiv:2502.14063, 2025
-
[23]
Shipeng Luo, Yuxin Zhang, Zeyu Zhang, Binhua Guo, Junbo Jacob Lian, Hui Jiang, Shun Zou, and Wei Wang. Epdd-yolo: An efficient benchmark for pavement damage detection based on mamba-yolo.Measurement, page 117638, 2025
work page 2025
-
[24]
Guohui Cai, Ying Cai, Zeyu Zhang, Yuanzhouhan Cao, Lin Wu, Daji Ergu, Zhibin Liao, and Yang Zhao. Medical artificial intelligence for early detection of lung cancer: A survey.Engineering Applications of Artificial Intelligence, 159:111577, 2025
work page 2025
-
[25]
Biao Wu, Yutong Xie, Zeyu Zhang, Minh Hieu Phan, Qi Chen, Ling Chen, and Qi Wu. Mmclip: Cross-modal attention masked modelling for medical language-image pre-training.arXiv preprint arXiv:2407.19546, 2024
-
[26]
Jointvit: Modeling oxygen saturation levels with joint supervision on long-tailed octa
Zeyu Zhang, Xuyin Qi, Mingxi Chen, Guangxi Li, Ryan Pham, Ayub Qassim, Ella Berry, Zhibin Liao, Owen Siggs, Robert Mclaughlin, et al. Jointvit: Modeling oxygen saturation levels with joint supervision on long-tailed octa. InAnnual Conference on Medical Image Understanding and Analysis, pages 158–172. Springer, 2024
work page 2024
-
[27]
Efficient learn- ing with sine-activated low-rank matrices.arXiv preprint arXiv:2403.19243, 2024
Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, and Simon Lucey. Efficient learning with sine-activated low-rank matrices.arXiv preprint arXiv:2403.19243, 2024
-
[28]
arXiv preprint arXiv:2502.00631 (2025) 18 Authors Suppressed Due to Excessive Length
Xuyin Qi, Zeyu Zhang, Huazhan Zheng, Mingxi Chen, Numan Kutaiba, Ruth Lim, Cherie Chiang, Zi En Tham, Xuan Ren, Wenxin Zhang, et al. Medconv: Convolutions beat transformers on long-tailed bone density prediction. arXiv preprint arXiv:2502.00631, 2025
-
[29]
arXiv preprint arXiv:2503.17970 (2025)
Yang Luo, Shiru Wang, Jun Liu, Jiaxuan Xiao, Rundong Xue, Zeyu Zhang, Hao Zhang, Yu Lu, Yang Zhao, and Yutong Xie. Pathohr: Breast cancer survival prediction on high-resolution pathological images.arXiv preprint arXiv:2503.17970, 2025
-
[30]
A deep learning approach to diabetes diagnosis
Zeyu Zhang, Khandaker Asif Ahmed, Md Rakibul Hasan, Tom Gedeon, and Md Zakir Hossain. A deep learning approach to diabetes diagnosis. InAsian Conference on Intelligent Information and Database Systems, pages 87–99. Springer, 2024
work page 2024
-
[31]
A landmark-based approach for instability prediction in distal radius fractures
Yang Zhao, Zhibin Liao, Yunxiang Liu, Koen Oude Nijhuis, Britt Barvelink, Jasper Prijs, Joost Colaris, Mathieu Wijffels, Max Reijman, Zeyu Zhang, et al. A landmark-based approach for instability prediction in distal radius fractures. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2024
work page 2024
-
[32]
Projectedex: Enhancing generation in explainable ai for prostate cancer
Xuyin Qi, Zeyu Zhang, Aaron Berliano Handoko, Huazhan Zheng, Mingxi Chen, Ta Duc Huy, Vu Minh Hieu Phan, Lei Zhang, Linqi Cheng, Shiyu Jiang, et al. Projectedex: Enhancing generation in explainable ai for prostate cancer. In2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS), pages 623–629. IEEE, 2025
work page 2025
-
[33]
Abhiram D Hiwase, Christopher D Ovenden, Lola M Kaukas, Mark Finnis, Zeyu Zhang, Stephanie O’Connor, Ngee Foo, Benjamin Reddi, Adam J Wells, and Daniel Y Ellis. Can rotational thromboelastometry rapidly identify theragnostic targets in isolated traumatic brain injury?Emergency Medicine Australasia, 37(1):e14480, 2025
work page 2025
-
[34]
Haiyue Zu, Jun Ge, Heting Xiao, Jile Xie, Zhangzhe Zhou, Yifan Meng, Jiayi Ni, Junjie Niu, Linlin Zhang, Li Ni, et al. Rethinking few-shot medical image segmentation by sam2: A training-free framework with augmentative prompting and dynamic matching.arXiv preprint arXiv:2503.04826, 2025
-
[35]
Junjun Wu, Yunbo Rao, Shaoning Zeng, and Bob Zhang. Pre-trained sam as data augmentation for image segmentation.CAAI Transactions on Intelligence Technology, 10(1):268–282, 2025
work page 2025
-
[36]
N Benjamin Erichson, Soon Hoe Lim, Francisco Utrera, Winnie Xu, Ziang Cao, and Michael W Mahoney. Noisymix: Boosting robustness by combining data augmentations, stability training, and noise injections.arXiv preprint arXiv:2202.01263, 1, 2022
-
[37]
Louisa Lam and SY Suen. Application of majority voting to pattern recognition: an analysis of its behavior and performance.IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 27(5):553–568, 1997. 12 SegTTA: Test-Time Augmentation for Medical Imaging Segmentation
work page 1997
-
[38]
Classification confidence weighted majority voting using decision tree classifiers
Norbert Toth and Bela Pataki. Classification confidence weighted majority voting using decision tree classifiers. International Journal of Intelligent Computing and Cybernetics, 1(2):169–192, 2008
work page 2008
-
[39]
UterUS: Uterus ultrasound database
Eva Boneš, Marco Gergolet, Ciril Bohak, Žiga Lesar, and Matija Marolt. UterUS: Uterus ultrasound database. https://github.com/UL-FRI-LGM/UterUS, 2024. Dataset with 3D ultrasound uterine volumes and nnUNet segmentation models; License: CC BY-NC-SA 4.0
work page 2024
-
[40]
Large-scale uterine myoma mri dataset covering all figo types with pixel-level annotations, 2024
Haoming Pan, Menghan Chen, Wenjie Bai, et al. Large-scale uterine myoma mri dataset covering all figo types with pixel-level annotations, 2024. UMD dataset: 300 cases of uterine myoma T2WI sagittal images with FIGO classification
work page 2024
-
[41]
M. Jorge Cardoso et al. MSD Task08: Hepatic Vessel Segmentation Challenge Dataset. http:// medicaldecathlon.com/, 2019. Part of the Medical Segmentation Decathlon (MSD). Available via Google Drive: Task08_HepaticVessel.tar
work page 2019
-
[42]
Segreg: Segmenting oars by registering mr images and ct annotations
Zeyu Zhang, Xuyin Qi, Bowen Zhang, Biao Wu, Hien Le, Bora Jeong, Zhibin Liao, Yunxiang Liu, Johan Verjans, Minh-Son To, et al. Segreg: Segmenting oars by registering mr images and ct annotations. In2024 IEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2024. 13
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.