arxiv: 2604.11711 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models

Nhan Ho , Luu Le , Thanh-Huy Nguyen , Thien Nguyen , Xiaofeng Liu , Ulas Bagci

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords occlusion robustnessSAM modelsmedical image segmentationendoscopypolyp segmentationbenchmarkamodal segmentationvisible tissue

0 comments

The pith

SAM-family models exhibit distinct occlusion behaviors, with some focusing on visible tissue and others inferring occluded anatomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates OccSAM-Bench to evaluate how foundation models for image segmentation perform when parts of the target, like polyps, are hidden by surgical tools or cut out in endoscopy. Using two types of simulated occlusions at three levels of severity on public datasets, it applies a new three-region metric that separately scores full, visible, and invisible areas. The results identify two groups of models: those aware of the occluder that stick to visible parts and avoid tools, and those agnostic that predict into the hidden zones. One model, SAM-Med2D, does not fit either and performs worse overall. This means robustness to occlusion depends on the specific model chosen for the task.

Core claim

The central discovery is that occlusion robustness varies across SAM-family architectures. Occluder-Aware models such as SAM, SAM 2, SAM 3, and MedSAM3 prioritize delineating only the visible tissue and reject instruments, while Occluder-Agnostic models like MedSAM and MedSAM2 confidently segment into occluded regions. SAM-Med2D underperforms in all tested conditions. The three-region protocol exposes these differences that standard evaluation misses.

What carries the argument

A three-region evaluation protocol that breaks down segmentation performance into full target, visible-only, and invisible target regions to reveal model behaviors under occlusion.

Load-bearing premise

Synthesized occlusions from surgical tool overlays and cutouts on polyp datasets accurately reflect the conditions and effects of real-world occlusions in clinical endoscopy procedures.

What would settle it

If evaluation on actual endoscopic videos with natural occlusions fails to reproduce the same division into occluder-aware and occluder-agnostic model groups, the benchmark's conclusions about model archetypes would not hold.

Figures

Figures reproduced from arXiv: 2604.11711 by Luu Le, Nhan Ho, Thanh-Huy Nguyen, Thien Nguyen, Ulas Bagci, Xiaofeng Liu.

**Figure 1.** Figure 1: OccSAM-Bench: evaluating segmentation under surgical occlusion. Clinical endoscopy often involves partial occlusion of target anatomy by surgical instruments. OccSAM-Bench simulates such conditions by overlaying surgical tools onto polyp images to create controlled occlusions. Models are evaluated using a three-region protocol that separates visible, invisible (occluded), and full masks, revealing behavi… view at source ↗

**Figure 2.** Figure 2: OccSAM-Bench overview. Left: Surgical-tool occlusions are simulated by pasting instrument masks onto polyp images from three colonoscopy datasets. Right: SAM-family models exhibit two behaviors under occlusion: Occluder-Aware models segment visible tissue while rejecting the instrument, whereas Occluder-Agnostic models predict through the occluder. Both can achieve similar full-mask scores, showing that st… view at source ↗

**Figure 3.** Figure 3: Visible-only DSC comparison between Cutout and Surgical Tool occlusions across 3 datasets under increasing severity (Clean, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visible-mask DSC comparison between Box and Single-Point prompts under increasing occlusion levels (Clean, Low, Medium, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a clean benchmark that splits SAM variants into occluder-aware and occluder-agnostic groups under synthetic tool overlays, but the clinical takeaway depends on how well those synthetics match real endoscopy.

read the letter

The main takeaway is that this benchmark cleanly separates how different SAM-family models behave when parts of a polyp are hidden by a tool or cutout. Some models (SAM, SAM 2, MedSAM3) stay conservative and only label visible tissue while rejecting the occluder. Others (MedSAM, MedSAM2) keep predicting into the hidden area. SAM-Med2D sits in the middle and does worse overall. The three-region scoring (full, visible-only, invisible) makes that split visible in a way standard Dice or IoU does not.

Referee Report

1 major / 2 minor

Summary. The paper introduces OccSAM-Bench, a controlled benchmark evaluating SAM-family models (including SAM, SAM 2, SAM 3, MedSAM variants, and SAM-Med2D) on polyp segmentation under synthesized surgical occlusions. It applies two occlusion types (tool overlay and cutout) at three calibrated severity levels to public polyp datasets and proposes a three-region protocol decomposing performance into full, visible-only, and invisible targets. This reveals two archetypes—Occluder-Aware models that prioritize visible tissue and reject instruments versus Occluder-Agnostic models that confidently inpaint occluded regions—with SAM-Med2D as an outlier; the work concludes that occlusion robustness is non-uniform and model choice should follow clinical intent for conservative visible versus amodal hidden anatomy segmentation.

Significance. The three-region protocol is a clear strength, as it systematically exposes model behaviors that standard amodal metrics would obscure, providing a reproducible way to characterize foundation model responses to occlusion in medical imaging. If the synthetic conditions prove representative, the archetype distinction offers practical guidance for selecting models in endoscopy. However, the clinical implications are only partially grounded without evidence that the controlled synthesis captures real occlusion physics.

major comments (1)

[Abstract] Abstract: the claim that 'model selection must be driven by specific clinical intent—whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy' is load-bearing for the paper's contribution but depends on the synthesized occlusions (tool overlay and cutout at three severity levels) accurately representing real surgical conditions; the evaluation protocol does not include validation against actual endoscopic frames with shadows, specular reflections, tissue deformation, or dynamic motion, leaving open the possibility that the observed archetype split is an artifact of the synthesis method rather than a transferable architectural property.

minor comments (2)

The abstract and methods should explicitly list all model variants with precise version citations (e.g., SAM 2 vs. SAM 3) and dataset splits to improve reproducibility.
Figure captions for the three-region protocol would benefit from an explicit diagram showing how visible-only and invisible targets are defined on the same frame.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for recognizing the value of the three-region protocol and for the constructive critique. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'model selection must be driven by specific clinical intent—whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy' is load-bearing for the paper's contribution but depends on the synthesized occlusions (tool overlay and cutout at three severity levels) accurately representing real surgical conditions; the evaluation protocol does not include validation against actual endoscopic frames with shadows, specular reflections, tissue deformation, or dynamic motion, leaving open the possibility that the observed archetype split is an artifact of the synthesis method rather than a transferable architectural property.

Authors: We appreciate the referee highlighting the dependence of our clinical recommendation on the fidelity of the synthetic occlusions. The controlled synthesis (calibrated tool overlays and cutouts at three severity levels) was intentionally chosen to isolate occlusion effects and enable reproducible severity scaling, which is difficult to achieve with the high variability of real endoscopic videos. The consistent emergence of the occluder-aware versus occluder-agnostic archetypes across SAM-family variants and multiple polyp datasets supports our view that these behaviors reflect architectural differences rather than synthesis artifacts. That said, we agree that the lack of direct validation against real frames containing shadows, specularities, deformation, and motion means the transferability of the archetype distinction to clinical practice is not yet fully demonstrated. In the revised manuscript we will (i) qualify the abstract claim to state that the guidance applies under controlled synthetic conditions and (ii) add an explicit limitations paragraph in the Discussion that acknowledges this gap and outlines future validation on real surgical videos. These changes constitute a partial revision that preserves the benchmark contribution while addressing the grounding concern. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct observations

full rationale

The paper introduces OccSAM-Bench as a controlled evaluation framework on public datasets, applies synthesized occlusions at fixed severity levels, and reports performance via a three-region protocol. No equations, fitted parameters, predictions, or derivations are present; the central distinction between Occluder-Aware and Occluder-Agnostic archetypes emerges directly from tabulated results on the chosen models and metrics. The evaluation protocol and occlusion synthesis are defined independently of the final claims, with no self-citation load-bearing the conclusions or renaming of prior results. This is a standard empirical benchmark study whose findings are falsifiable against the released benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the artificial occlusions faithfully model clinical conditions and that the three-region decomposition reveals clinically relevant behaviors.

axioms (1)

domain assumption Synthesized tool overlay and cutout occlusions at calibrated severity levels accurately simulate real surgical occlusion in endoscopy
Invoked when claiming the benchmark evaluates real-world robustness

pith-pipeline@v0.9.0 · 5525 in / 1236 out tokens · 49687 ms · 2026-05-10T16:20:40.043392+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 7 canonical work pages · 2 internal anchors

[1]

SAM 3: Segment any- thing with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris Coll-Vinent, Chai- tanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R ¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han W...

2026
[2]

Christian Chang, Hudson Law, Connor Poon, Sydney Yen, Kaustubh Lall, Armin Jamshidi, Vadim Malis, Dosik Hwang, and Won C. Bae. Segment anything model (SAM) and medical SAM (MedSAM) for lumbar spine MRI.Sen- sors, 25(12):3596, 2025. 7

2025
[3]

CoRRabs/2308.16184 (2023)

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Ji- long Chen, Lei Jiang, et al. Sam-med2d.arXiv preprint arXiv:2308.16184, 2023. 1, 2, 4, 5

work page arXiv 2023
[4]

Improved Regularization of Convolutional Neural Networks with Cutout

Terrance DeVries and Graham W Taylor. Improved regular- ization of convolutional neural networks with cutout.arXiv preprint arXiv:1708.04552, 2017. 3

work page internal anchor Pith review arXiv 2017
[5]

Stable segment anything model

Qi Fan, Xin Tao, Lei Ke, Mingqiao Ye, Di Zhang, Pengfei Wan, Yu-Wing Tai, and Chi-Keung Tang. Stable segment anything model. InInternational Conference on Learning Representations (ICLR), 2025. 1

2025
[6]

Learning to see the invisible: End-to-end trainable amodal instance segmen- tation

Patrick Follmann, Rebecca K ¨onig, Philipp H ¨artinger, Michael Klostermann, and Tobias B ¨ottger. Learning to see the invisible: End-to-end trainable amodal instance segmen- tation. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1328–1336. IEEE, 2019. 2

2019
[7]

Medsam: Segment anything in medical images,

Sheng He, Rina Bao, Jingpeng Li, Jeffrey Stout, Atle Bjornerud, P. Ellen Grant, and Yangming Ou. Computer- vision benchmark segment-anything model (SAM) in med- ical images: Accuracy in 12 datasets.arXiv preprint arXiv:2304.09324, 2023. 2

work page arXiv 2023
[8]

Segment anything model for medical images?Medical Image Analysis, 92:103061, 2024

Yuhao Huang, Xin Yang, Lian Liu, Han Zhou, Ao Chang, Xinrui Zhou, Rusi Chen, Junxuan Yu, Jiongquan Chen, Chaoyu Chen, et al. Segment anything model for medical images?Medical Image Analysis, 92:103061, 2024. 1, 2

2024
[9]

Kvasir-instrument: Diagnostic and therapeu- tic tool segmentation dataset in gastrointestinal endoscopy

Debesh Jha, Sharib Ali, Krister Emanuelsen, Steven A Hicks, Vajira Thambawita, Enrique Garcia-Ceja, Michael A Riegler, Thomas De Lange, Peter T Schmidt, H ˚avard D Jo- hansen, et al. Kvasir-instrument: Diagnostic and therapeu- tic tool segmentation dataset in gastrointestinal endoscopy. InInternational Conference on Multimedia Modeling, pages 218–229. Spr...

2021
[10]

Deep occlusion- aware instance segmentation with overlapping BiLayers

Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. Deep occlusion- aware instance segmentation with overlapping BiLayers. In Proc. IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 4184–4193, 2021. 2

2021
[11]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 4, 5

2023
[12]

Ultra- ecp: Ellipse-constrained and point-robust foundation model adaptation for fetal cardiac ultrasound segmentation

Minh HN Le, Khanh TQ Le, Tuan Vinh, Thanh-Huy Nguyen, Han H Huynh, Khoa D Pham, Anh Mai Vu, Hien Quang Kha, Phat Ky Nguyen, Ulas Bagci, et al. Ultra- ecp: Ellipse-constrained and point-robust foundation model adaptation for fetal cardiac ultrasound segmentation. InMed- ical Imaging with Deep Learning, 2026. 2

2026
[13]

Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen

Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen. Medsam3: Delv- ing into segment anything with medical concepts, 2025. 1, 4, 5

2025
[14]

Segment anything in medical images.Nature communications, 15(1):654, 2024

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature communications, 15(1):654, 2024. 1, 2, 4, 5, 7

2024
[15]

Segment anything model for medical image analysis: an experimental study

Maciej A Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, and Yixin Zhang. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023. 1, 2, 4

2023
[16]

Up2d: Uncertainty-aware progres- sive pseudo-label denoising for source-free domain adap- tive medical image segmentation.Neurocomputing, page 132659, 2026

Thanh-Huy Nguyen, Quang-Khai Bui-Tran, Manh D Ho, Thinh B Lam, Vi Vu, Hoang-Thien Nguyen, Phat Huynh, and Ulas Bagci. Up2d: Uncertainty-aware progres- sive pseudo-label denoising for source-free domain adap- tive medical image segmentation.Neurocomputing, page 132659, 2026. 2

2026
[17]

Adaptive knowledge transferring with switching dual-student framework for semi-supervised medical image segmentation.Pattern Recognition, page 113115, 2026

Thanh-Huy Nguyen, Hoang-Thien Nguyen, Ba-Thinh Lam, Vi Vu, Bach X Nguyen, Jianhua Xing, Tianyang Wang, Xingjian Li, and Min Xu. Adaptive knowledge transferring with switching dual-student framework for semi-supervised medical image segmentation.Pattern Recognition, page 113115, 2026. 2

2026
[18]

Amodal instance segmentation with KINS dataset

Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with KINS dataset. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3014–3023, 2019. 2

2019
[19]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 1, 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the ROBUST-MIS 2019 challenge.Medical image analysis, 70:101920, 2021

Tobias Roß, Annika Reinke, Peter M Full, Martin Wag- ner, Hannes Kenngott, Martin Apitz, Hellena Hempe, Di- ana Mindroc-Filimon, Patrick Scholz, Thuy Nuong Tran, et al. Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the ROBUST-MIS 2019 challenge.Medical image analysis, 70:101920, 2021. 2

2019
[21]

Tobias Rueckert, Maximilian Rieder, David Rauber, Michel Xiao, Eg Humolli, Hubertus Feussner, Dirk Wilhelm, and Christoph Palm. Augmenting instrument segmentation in video sequences of minimally invasive surgery by synthetic smoky frames.International Journal of Computer Assisted Radiology and Surgery, pages S54–S56, 2023. 2

2023
[22]

Toward embedded detection of polyps in WCE images for early diagnosis of colorectal can- cer.International journal of computer assisted radiology and surgery, 9(2):283–293, 2014

Juan Silva, Aymeric Histace, Olivier Romain, Xavier Dray, and Bertrand Granado. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal can- cer.International journal of computer assisted radiology and surgery, 9(2):283–293, 2014. 4

2014
[23]

Segment anything, even oc- cluded

Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, and Hwann-Tzong Chen. Segment anything, even oc- cluded. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29385–29394, 2025. 1, 2

2025
[24]

Gurudu, and Jianming Liang

Nima Tajbakhsh, Suryakanth R. Gurudu, and Jianming Liang. Automated polyp detection in colonoscopy videos using shape and context information.IEEE Transactions on Medical Imaging, 35(2):630–644, 2016. 4

2016
[25]

Unsupervised multi- scale segmentation of cellular cryo-electron tomograms with stable diffusion foundation model.bioRxiv, pages 2025–06,

Mostofa Rafid Uddin, Thanh-Huy Nguyen, HM Shadman Tabib, Kashish Gandhi, and Min Xu. Unsupervised multi- scale segmentation of cellular cryo-electron tomograms with stable diffusion foundation model.bioRxiv, pages 2025–06,

2025
[26]

A benchmark for en- doluminal scene segmentation of colonoscopy images.Jour- nal of healthcare engineering, 2017(1):4037190, 2017

David V ´azquez, Jorge Bernal, F Javier S ´anchez, Gloria Fern´andez-Esparrach, Antonio M L ´opez, Adriana Rsomero, Michal Drozdzal, and Aaron Courville. A benchmark for en- doluminal scene segmentation of colonoscopy images.Jour- nal of healthcare engineering, 2017(1):4037190, 2017. 4

2017
[27]

From specialist to generalist: Unlocking sam’s learning potential on unlabeled medical images.arXiv preprint arXiv:2601.17934, 2026

Vi Vu, Thanh-Huy Nguyen, Tien-Thinh Nguyen, Ba-Thinh Lam, Hoang-Thien Nguyen, Tianyang Wang, Xingjian Li, and Min Xu. From specialist to generalist: Unlocking sam’s learning potential on unlabeled medical images.arXiv preprint arXiv:2601.17934, 2026. 2

work page arXiv 2026
[28]

Describe anything in medical images.arXiv preprint arXiv:2505.05804, 2025

Xi Xiao, Yunbei Zhang, Thanh-Huy Nguyen, Ba-Thinh Lam, Janet Wang, Lin Zhao, Jihun Hamm, Tianyang Wang, Xingjian Li, Xiao Wang, et al. Describe anything in medical images.arXiv preprint arXiv:2505.05804, 2025. 2

work page arXiv 2025
[29]

Medical sam 2: Segment medical images as video via segment anything model 2,

Jiayuan Zhu, Abdullah Hamdi, Yunli Qi, Yueming Jin, and Junde Wu. MedSAM2: Segment medical images as video via segment anything model 2.arXiv preprint arXiv:2408.00874, 2024. 1, 2, 4, 5, 7

work page arXiv 2024
[30]

Semantic amodal segmentation

Yan Zhu, Yuandong Tian, Dimitris Metaxas, and Piotr Doll´ar. Semantic amodal segmentation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1464–1472, 2017. 2 Supplementary Material We report additional quantitative results across three datasets under different occlusion generation strategies and prompt set- tings. ...

2017