Polyp Detection and Segmentation using Mask R-CNN: Does a Deeper Feature Extractor CNN Always Perform Better?
Pith reviewed 2026-05-24 18:26 UTC · model grok-4.3
The pith
Mask R-CNN with added polyp images and an ensemble reaches state-of-the-art segmentation on the 2015 MICCAI dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mask R-CNN is adapted with different CNN feature extractors; adding extra polyp images improves results for each extractor; an ensemble method yields further gains that reach state-of-the-art segmentation performance on the 2015 MICCAI polyp detection dataset.
What carries the argument
Mask R-CNN with interchangeable CNN backbones plus an ensemble of model outputs for final detection and segmentation masks.
If this is right
- Adding more polyp images to the training set raises recall, precision, dice, and Jaccard for every tested feature extractor.
- An ensemble of Mask R-CNN outputs outperforms any single model on the same dataset.
- Deeper feature extractors do not automatically deliver the highest scores once extra training images are available.
- The reported metrics set a new state-of-the-art segmentation benchmark on the 2015 MICCAI polyp detection task.
Where Pith is reading between the lines
- If the MICCAI 2015 images differ systematically from current clinical practice, the reported numbers may overestimate real-world performance.
- Similar medical imaging tasks with high visual variation may benefit more from targeted data collection than from ever-deeper backbones.
- Cross-site validation on fresh colonoscopy cohorts would directly test whether the ensemble advantage persists outside the original benchmark.
Load-bearing premise
The 2015 MICCAI polyp detection dataset and its evaluation protocol are representative enough of real clinical colonoscopy images that the reported metrics will generalize.
What would settle it
Running the best ensemble model on an independent collection of colonoscopy videos from new patients or different endoscopes and obtaining dice scores below 60 percent.
Figures
read the original abstract
Automatic polyp detection and segmentation are highly desirable for colon screening due to polyp miss rate by physicians during colonoscopy, which is about 25%. However, this computerization is still an unsolved problem due to various polyp-like structures in the colon and high interclass polyp variations in terms of size, color, shape, and texture. In this paper, we adapt Mask R-CNN and evaluate its performance with different modern convolutional neural networks (CNN) as its feature extractor for polyp detection and segmentation. We investigate the performance improvement of each feature extractor by adding extra polyp images to the training dataset to answer whether we need deeper and more complex CNNs or better dataset for training in automatic polyp detection and segmentation. Finally, we propose an ensemble method for further performance improvement. We evaluate the performance on the 2015 MICCAI polyp detection dataset. The best results achieved are 72.59% recall, 80% precision, 70.42% dice, and 61.24% Jaccard. The model achieved state-of-the-art segmentation performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts Mask R-CNN for simultaneous polyp detection and segmentation, systematically tests multiple modern CNN backbones as feature extractors, measures the effect of augmenting the training set with additional polyp images, introduces an ensemble of the resulting models, and evaluates everything on the 2015 MICCAI polyp detection dataset. It reports peak figures of 72.59 % recall, 80 % precision, 70.42 % Dice and 61.24 % Jaccard and asserts that these constitute state-of-the-art segmentation performance.
Significance. Should the numerical superiority be confirmed on the official 2015 MICCAI held-out test split with fair, protocol-matched baselines, the study would supply concrete empirical guidance on the relative value of deeper backbones versus additional training data for medical segmentation tasks.
major comments (2)
- [Abstract / Results] Abstract and results section: the manuscript asserts state-of-the-art segmentation performance (70.42 % Dice, 61.24 % Jaccard) yet provides neither a comparison table nor explicit numerical citations demonstrating that these figures exceed all prior methods evaluated on the identical official 2015 MICCAI test split; without this evidence the central SOTA claim cannot be verified.
- [Evaluation] Evaluation protocol description: the text does not state whether the extra polyp images added to training intersect the official test set, nor does it document the precise train/validation/test partitioning or the re-implementation details of the cited baseline methods; these omissions directly affect the validity of the reported superiority.
minor comments (1)
- [Abstract / Conclusion] The title poses the question whether deeper feature extractors always perform better, but the abstract and conclusion do not explicitly state the paper’s answer to that question.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results section: the manuscript asserts state-of-the-art segmentation performance (70.42 % Dice, 61.24 % Jaccard) yet provides neither a comparison table nor explicit numerical citations demonstrating that these figures exceed all prior methods evaluated on the identical official 2015 MICCAI test split; without this evidence the central SOTA claim cannot be verified.
Authors: We agree that the SOTA claim requires explicit supporting evidence in the form of a comparison. In the revised manuscript we will add a dedicated table in the results section that lists Dice and Jaccard scores from prior methods evaluated on the official 2015 MICCAI held-out test split, together with the corresponding citations. This will allow direct verification of the reported figures. revision: yes
-
Referee: [Evaluation] Evaluation protocol description: the text does not state whether the extra polyp images added to training intersect the official test set, nor does it document the precise train/validation/test partitioning or the re-implementation details of the cited baseline methods; these omissions directly affect the validity of the reported superiority.
Authors: We acknowledge these omissions in the current text. The revised manuscript will explicitly state that the additional polyp images were drawn from sources with no overlap to the official test set, will document the exact train/validation/test partitioning used, and will provide the re-implementation details (hyperparameters, training schedule, and code-level adaptations) for all cited baseline methods to ensure the comparisons are reproducible and protocol-matched. revision: yes
Circularity Check
No circularity: purely empirical results on public benchmark with no derivations
full rationale
The paper reports direct experimental outcomes from training Mask R-CNN variants with different backbones on the 2015 MICCAI polyp dataset, with and without added training images, followed by an ensemble. All reported numbers (72.59% recall, 80% precision, 70.42% Dice, 61.24% Jaccard) are measured performance on the held-out test split; no equations, first-principles predictions, or fitted parameters are redefined as outputs. The SOTA segmentation claim rests on numerical comparison to prior external methods under the same public protocol, which is falsifiable outside the paper. No self-citation chain or ansatz is invoked to justify any central result. This is a standard empirical ML evaluation with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
R. L. Siegel, K. D. Miller, and A. Jemal. “Cancer statistics, 2018,” American Cancer Society, 68(1):730, 2018
work page 2018
-
[2]
M. Gschwantler, S. Kriwanek, E. Langner, B. Goritzer, C. Schrutka- Kolbl, E. Brownstone, H. Feichtinger, and W. Weiss. “High-grade dysplasia and invasive carcinoma in colorectal adenomas: a multivariate analysis of the impact of adenoma and patient characteristics,” European journal of gastroenterology and hepatology, 14(2):183188, 2002
work page 2002
-
[3]
Global patterns and trends in colorectal cancer incidence and mortality,
M. Arnold, M. S. Sierra, M. Laversanne, I. Soerjomataram, A. Jemal,and F. Bray. “Global patterns and trends in colorectal cancer incidence and mortality,” Gut, pages gutjnl2015, 2016
work page 2016
-
[4]
Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,
A. M. Leufkens, M. G. H. van Oijen, F. P. Vleggaar, and P. D. Siersema. “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, 44(05):470475, 2012
work page 2012
-
[5]
Computer-aided tumor detection in endoscopic video using color wavelet features,
S. A. Karkanis, D. K. Iakovidis, D. E. Maroulis, D. A. Karras, and M. Tzivras. “Computer-aided tumor detection in endoscopic video using color wavelet features,” IEEE transactions on information technology in biomedicine, 7(3):141152, 2003
work page 2003
-
[6]
Color and position versus texture features for endoscopic polyp detection,
L. A. Alexandre, N. Nobre, and J. Casteleiro. “Color and position versus texture features for endoscopic polyp detection,” In BioMedical Engi- neering and Informatics, 2008. BMEI 2008. International Conference on, volume 2, pages 3842. IEEE, 2008
work page 2008
-
[7]
Texture- based polyp detection in colonoscopy,
S. Ameling, S.Wirth, D. Paulus, G. Lacey, and F. Vilarino. “Texture- based polyp detection in colonoscopy,” In Bildverarbeitung fur die Medizin 2009, pages 346350. Springer, 2009
work page 2009
-
[8]
A colon video analysis framework for polyp detection,
S. Park, D. Sargent, I. Spofford, K. G. V osburgh, and Y . A-Rahim. “A colon video analysis framework for polyp detection,” IEEE Transactions on Biomedical Engineering, 59(5):1408, 2012
work page 2012
-
[9]
Polyp detection in colonoscopy video using elliptical shape feature,
S. Hwang, J. Oh, W. Tavanapong, J. Wong, and P. C. De Groen. “Polyp detection in colonoscopy video using elliptical shape feature,” In Image Processing, 2007. ICIP 2007. IEEE International Conference on, volume 2, pages II465. IEEE, 2007
work page 2007
-
[10]
Towards automatic polyp detection with a polyp appearance model,
J. Bernal, J. Sanchez, and F. Vilarino. “Towards automatic polyp detection with a polyp appearance model,” Pattern Recognition, 45(9):31663182, 2012
work page 2012
-
[11]
Automated polyp detection in colonoscopy videos using shape and context information,
N. Tajbakhsh, S. R. Gurudu, and J. Liang. “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, 35(2):630644, 2016
work page 2016
-
[12]
J. Bernal, N. Tajkbaksh, F. J. Sanchez, B. J. Matuszewski, H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rustad, I. Balasingham, et al. “com- parative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,” IEEE trans- actions on medical imaging, 36(6):12311249, 2017
work page 2015
-
[13]
P. Brandao, E. Mazomenos, G. Ciuti, R. Cali, F. Bianchi, A. Menci- assi, P. Dario, A. Koulaouzidis, A. Arezzo, and D. Stoyanov. ”Fully convolutional neural networks for polyp segmentation in colonoscopy.” In Medical Imaging 2017: Computer-Aided Diagnosis, vol. 10134, p. 101340F. International Society for Optics and Photonics, 2017
work page 2017
-
[14]
Q. Li, G. Yang, Z. Chen, B. Huang, L. Chen, D. Xu, X. Zhou, S. Zhong, H. Zhang, and T. Wang. ”Colorectal polyp segmentation using a fully convolutional neural network.” In Image and Signal Processing, BioMed- ical Engineering and Informatics (CISP-BMEI), 2017 10th International Congress on, pp. 1-5. IEEE, 2017
work page 2017
- [15]
-
[16]
Y . Shin, H. A. Qadir, L. Aabakken, J. Bergsland, and I. Balasingham. Automatic colon polyp detection using region based deep cnn and post learning approaches. IEEE Access, 6:4095040962, 2018
work page 2018
- [17]
-
[18]
L. Yu, H. Chen, Q. Dou, J. Qin, and P. Ann Heng. Integrating online and offline three-dimensional deep learning for automated polyp detection in colonoscopy videos. IEEE journal of biomedical and health informatics, 21(1):6575, 2017
work page 2017
-
[19]
Shin, Younghak, Hemin Ali Qadir, and Ilangko Balasingham. ”Abnormal Colon Polyp Image Synthesis Using Conditional Adversarial Networks for Improved Detection Performance.” IEEE Access 6 (2018): 56007- 56017
work page 2018
-
[20]
K. He, G. Gkioxari, P. Dollr, and R. Girshick. ”Mask r-cnn.” In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980-2988. IEEE, 2017
work page 2017
-
[21]
K. He, X. Zhang, S. Ren, and J. Sun. ”Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016
work page 2016
-
[22]
C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi. ”Inception-v4, inception-resnet and the impact of residual connections on learning.” In AAAI, vol. 4, p. 12. 2017
work page 2017
- [23]
-
[24]
J. S. Silva, A. Histace, O. Romain, X. Dray, B. Granado, ”Towards embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,” International Journal of Computer Assisted Radiology and Surgery, Springer Verlag (Germany), 2014, 9 (2), pp. 283-293
work page 2014
- [25]
-
[26]
S. Ren, K. He, R. Girshick, and J. Sun. ”Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015
work page 2015
-
[27]
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. ”Going deeper with convolutions.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. 2015
work page 2015
-
[28]
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ar,and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740755. Springer,2014
work page 2014
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.