pith. sign in

arxiv: 1907.09180 · v1 · pith:KUAK27E4new · submitted 2019-07-22 · 💻 cs.CV · cs.LG· eess.IV

Polyp Detection and Segmentation using Mask R-CNN: Does a Deeper Feature Extractor CNN Always Perform Better?

Pith reviewed 2026-05-24 18:26 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV
keywords polyp detectionpolyp segmentationMask R-CNNcolonoscopyfeature extractorensembleMICCAI 2015 datasetdeep learning
0
0 comments X

The pith

Mask R-CNN with added polyp images and an ensemble reaches state-of-the-art segmentation on the 2015 MICCAI dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts Mask R-CNN for polyp detection and segmentation to address the roughly 25 percent miss rate during colonoscopy. It tests multiple modern CNNs as feature extractors and measures how much performance improves when extra polyp images are added to training. The results indicate that dataset expansion helps each backbone and that ensembling the outputs produces the top scores of 72.59 percent recall, 80 percent precision, 70.42 percent dice, and 61.24 percent Jaccard. These figures establish new state-of-the-art segmentation performance on the benchmark.

Core claim

Mask R-CNN is adapted with different CNN feature extractors; adding extra polyp images improves results for each extractor; an ensemble method yields further gains that reach state-of-the-art segmentation performance on the 2015 MICCAI polyp detection dataset.

What carries the argument

Mask R-CNN with interchangeable CNN backbones plus an ensemble of model outputs for final detection and segmentation masks.

If this is right

  • Adding more polyp images to the training set raises recall, precision, dice, and Jaccard for every tested feature extractor.
  • An ensemble of Mask R-CNN outputs outperforms any single model on the same dataset.
  • Deeper feature extractors do not automatically deliver the highest scores once extra training images are available.
  • The reported metrics set a new state-of-the-art segmentation benchmark on the 2015 MICCAI polyp detection task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the MICCAI 2015 images differ systematically from current clinical practice, the reported numbers may overestimate real-world performance.
  • Similar medical imaging tasks with high visual variation may benefit more from targeted data collection than from ever-deeper backbones.
  • Cross-site validation on fresh colonoscopy cohorts would directly test whether the ensemble advantage persists outside the original benchmark.

Load-bearing premise

The 2015 MICCAI polyp detection dataset and its evaluation protocol are representative enough of real clinical colonoscopy images that the reported metrics will generalize.

What would settle it

Running the best ensemble model on an independent collection of colonoscopy videos from new patients or different endoscopes and obtaining dice scores below 60 percent.

Figures

Figures reproduced from arXiv: 1907.09180 by Hemin Ali Qadir, Ilangko Balasingham, Jacob Bergsland, Johannes Solhusvik, Lars Aabakken, Younghak Shin.

Figure 1
Figure 1. Figure 1: Our Mask R-CNN framework. In the first stage, we use Resnet50, Resnet101 and Resnet Inception v2 as the feature extractor for the performance [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy of the CNN feature extractors vs. number of epochs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of three outputs produced by our Mask R-CNN models. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of two outputs produced by the three Mask R-CNN models. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of three outputs produced by Mask R-CNN with Inception [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Automatic polyp detection and segmentation are highly desirable for colon screening due to polyp miss rate by physicians during colonoscopy, which is about 25%. However, this computerization is still an unsolved problem due to various polyp-like structures in the colon and high interclass polyp variations in terms of size, color, shape, and texture. In this paper, we adapt Mask R-CNN and evaluate its performance with different modern convolutional neural networks (CNN) as its feature extractor for polyp detection and segmentation. We investigate the performance improvement of each feature extractor by adding extra polyp images to the training dataset to answer whether we need deeper and more complex CNNs or better dataset for training in automatic polyp detection and segmentation. Finally, we propose an ensemble method for further performance improvement. We evaluate the performance on the 2015 MICCAI polyp detection dataset. The best results achieved are 72.59% recall, 80% precision, 70.42% dice, and 61.24% Jaccard. The model achieved state-of-the-art segmentation performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper adapts Mask R-CNN for simultaneous polyp detection and segmentation, systematically tests multiple modern CNN backbones as feature extractors, measures the effect of augmenting the training set with additional polyp images, introduces an ensemble of the resulting models, and evaluates everything on the 2015 MICCAI polyp detection dataset. It reports peak figures of 72.59 % recall, 80 % precision, 70.42 % Dice and 61.24 % Jaccard and asserts that these constitute state-of-the-art segmentation performance.

Significance. Should the numerical superiority be confirmed on the official 2015 MICCAI held-out test split with fair, protocol-matched baselines, the study would supply concrete empirical guidance on the relative value of deeper backbones versus additional training data for medical segmentation tasks.

major comments (2)
  1. [Abstract / Results] Abstract and results section: the manuscript asserts state-of-the-art segmentation performance (70.42 % Dice, 61.24 % Jaccard) yet provides neither a comparison table nor explicit numerical citations demonstrating that these figures exceed all prior methods evaluated on the identical official 2015 MICCAI test split; without this evidence the central SOTA claim cannot be verified.
  2. [Evaluation] Evaluation protocol description: the text does not state whether the extra polyp images added to training intersect the official test set, nor does it document the precise train/validation/test partitioning or the re-implementation details of the cited baseline methods; these omissions directly affect the validity of the reported superiority.
minor comments (1)
  1. [Abstract / Conclusion] The title poses the question whether deeper feature extractors always perform better, but the abstract and conclusion do not explicitly state the paper’s answer to that question.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results section: the manuscript asserts state-of-the-art segmentation performance (70.42 % Dice, 61.24 % Jaccard) yet provides neither a comparison table nor explicit numerical citations demonstrating that these figures exceed all prior methods evaluated on the identical official 2015 MICCAI test split; without this evidence the central SOTA claim cannot be verified.

    Authors: We agree that the SOTA claim requires explicit supporting evidence in the form of a comparison. In the revised manuscript we will add a dedicated table in the results section that lists Dice and Jaccard scores from prior methods evaluated on the official 2015 MICCAI held-out test split, together with the corresponding citations. This will allow direct verification of the reported figures. revision: yes

  2. Referee: [Evaluation] Evaluation protocol description: the text does not state whether the extra polyp images added to training intersect the official test set, nor does it document the precise train/validation/test partitioning or the re-implementation details of the cited baseline methods; these omissions directly affect the validity of the reported superiority.

    Authors: We acknowledge these omissions in the current text. The revised manuscript will explicitly state that the additional polyp images were drawn from sources with no overlap to the official test set, will document the exact train/validation/test partitioning used, and will provide the re-implementation details (hyperparameters, training schedule, and code-level adaptations) for all cited baseline methods to ensure the comparisons are reproducible and protocol-matched. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical results on public benchmark with no derivations

full rationale

The paper reports direct experimental outcomes from training Mask R-CNN variants with different backbones on the 2015 MICCAI polyp dataset, with and without added training images, followed by an ensemble. All reported numbers (72.59% recall, 80% precision, 70.42% Dice, 61.24% Jaccard) are measured performance on the held-out test split; no equations, first-principles predictions, or fitted parameters are redefined as outputs. The SOTA segmentation claim rests on numerical comparison to prior external methods under the same public protocol, which is falsifiable outside the paper. No self-citation chain or ansatz is invoked to justify any central result. This is a standard empirical ML evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5747 in / 1230 out tokens · 21392 ms · 2026-05-24T18:26:42.785970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Cancer statistics, 2018,

    R. L. Siegel, K. D. Miller, and A. Jemal. “Cancer statistics, 2018,” American Cancer Society, 68(1):730, 2018

  2. [2]

    High-grade dysplasia and invasive carcinoma in colorectal adenomas: a multivariate analysis of the impact of adenoma and patient characteristics,

    M. Gschwantler, S. Kriwanek, E. Langner, B. Goritzer, C. Schrutka- Kolbl, E. Brownstone, H. Feichtinger, and W. Weiss. “High-grade dysplasia and invasive carcinoma in colorectal adenomas: a multivariate analysis of the impact of adenoma and patient characteristics,” European journal of gastroenterology and hepatology, 14(2):183188, 2002

  3. [3]

    Global patterns and trends in colorectal cancer incidence and mortality,

    M. Arnold, M. S. Sierra, M. Laversanne, I. Soerjomataram, A. Jemal,and F. Bray. “Global patterns and trends in colorectal cancer incidence and mortality,” Gut, pages gutjnl2015, 2016

  4. [4]

    Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,

    A. M. Leufkens, M. G. H. van Oijen, F. P. Vleggaar, and P. D. Siersema. “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, 44(05):470475, 2012

  5. [5]

    Computer-aided tumor detection in endoscopic video using color wavelet features,

    S. A. Karkanis, D. K. Iakovidis, D. E. Maroulis, D. A. Karras, and M. Tzivras. “Computer-aided tumor detection in endoscopic video using color wavelet features,” IEEE transactions on information technology in biomedicine, 7(3):141152, 2003

  6. [6]

    Color and position versus texture features for endoscopic polyp detection,

    L. A. Alexandre, N. Nobre, and J. Casteleiro. “Color and position versus texture features for endoscopic polyp detection,” In BioMedical Engi- neering and Informatics, 2008. BMEI 2008. International Conference on, volume 2, pages 3842. IEEE, 2008

  7. [7]

    Texture- based polyp detection in colonoscopy,

    S. Ameling, S.Wirth, D. Paulus, G. Lacey, and F. Vilarino. “Texture- based polyp detection in colonoscopy,” In Bildverarbeitung fur die Medizin 2009, pages 346350. Springer, 2009

  8. [8]

    A colon video analysis framework for polyp detection,

    S. Park, D. Sargent, I. Spofford, K. G. V osburgh, and Y . A-Rahim. “A colon video analysis framework for polyp detection,” IEEE Transactions on Biomedical Engineering, 59(5):1408, 2012

  9. [9]

    Polyp detection in colonoscopy video using elliptical shape feature,

    S. Hwang, J. Oh, W. Tavanapong, J. Wong, and P. C. De Groen. “Polyp detection in colonoscopy video using elliptical shape feature,” In Image Processing, 2007. ICIP 2007. IEEE International Conference on, volume 2, pages II465. IEEE, 2007

  10. [10]

    Towards automatic polyp detection with a polyp appearance model,

    J. Bernal, J. Sanchez, and F. Vilarino. “Towards automatic polyp detection with a polyp appearance model,” Pattern Recognition, 45(9):31663182, 2012

  11. [11]

    Automated polyp detection in colonoscopy videos using shape and context information,

    N. Tajbakhsh, S. R. Gurudu, and J. Liang. “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, 35(2):630644, 2016

  12. [12]

    com- parative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,

    J. Bernal, N. Tajkbaksh, F. J. Sanchez, B. J. Matuszewski, H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rustad, I. Balasingham, et al. “com- parative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge,” IEEE trans- actions on medical imaging, 36(6):12311249, 2017

  13. [13]

    Brandao, E

    P. Brandao, E. Mazomenos, G. Ciuti, R. Cali, F. Bianchi, A. Menci- assi, P. Dario, A. Koulaouzidis, A. Arezzo, and D. Stoyanov. ”Fully convolutional neural networks for polyp segmentation in colonoscopy.” In Medical Imaging 2017: Computer-Aided Diagnosis, vol. 10134, p. 101340F. International Society for Optics and Photonics, 2017

  14. [14]

    Q. Li, G. Yang, Z. Chen, B. Huang, L. Chen, D. Xu, X. Zhou, S. Zhong, H. Zhang, and T. Wang. ”Colorectal polyp segmentation using a fully convolutional neural network.” In Image and Signal Processing, BioMed- ical Engineering and Informatics (CISP-BMEI), 2017 10th International Congress on, pp. 1-5. IEEE, 2017

  15. [15]

    Zhang, S

    L. Zhang, S. Dolwani, and X. Ye. ”Automated polyp segmentation in colonoscopy frames using fully convolutional neural network and textons.” In Annual Conference on Medical Image Understanding and Analysis, pp. 707-717. Springer, Cham, 2017

  16. [16]

    Y . Shin, H. A. Qadir, L. Aabakken, J. Bergsland, and I. Balasingham. Automatic colon polyp detection using region based deep cnn and post learning approaches. IEEE Access, 6:4095040962, 2018

  17. [17]

    Zhang, Y

    R. Zhang, Y . Zheng, C. CY Poon, D. Shen, and J. YW Lau. Polyp detection during colonoscopy using a regression-based convolutional neural network with a tracker

  18. [18]

    L. Yu, H. Chen, Q. Dou, J. Qin, and P. Ann Heng. Integrating online and offline three-dimensional deep learning for automated polyp detection in colonoscopy videos. IEEE journal of biomedical and health informatics, 21(1):6575, 2017

  19. [19]

    ”Abnormal Colon Polyp Image Synthesis Using Conditional Adversarial Networks for Improved Detection Performance.” IEEE Access 6 (2018): 56007- 56017

    Shin, Younghak, Hemin Ali Qadir, and Ilangko Balasingham. ”Abnormal Colon Polyp Image Synthesis Using Conditional Adversarial Networks for Improved Detection Performance.” IEEE Access 6 (2018): 56007- 56017

  20. [20]

    K. He, G. Gkioxari, P. Dollr, and R. Girshick. ”Mask r-cnn.” In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980-2988. IEEE, 2017

  21. [21]

    K. He, X. Zhang, S. Ren, and J. Sun. ”Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016

  22. [22]

    Szegedy, S

    C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi. ”Inception-v4, inception-resnet and the impact of residual connections on learning.” In AAAI, vol. 4, p. 12. 2017

  23. [23]

    Bernal, F

    J. Bernal, F. J. Snchez, G. Fernndez-Esparrach, D. Gil, C. Rodrguez, and F. Vilario (2015). ”WM-DOV A maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computer- ized Medical Imaging and Graphics, 43, 99-111

  24. [24]

    J. S. Silva, A. Histace, O. Romain, X. Dray, B. Granado, ”Towards embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,” International Journal of Computer Assisted Radiology and Surgery, Springer Verlag (Germany), 2014, 9 (2), pp. 283-293

  25. [25]

    Bernal, F

    J. Bernal, F. J.r Sanchez, and F. Vilario. (2012). Towards Automatic Polyp Detection with a Polyp Appearance Model, Pattern Recognition, 45(9), 31663182

  26. [26]

    S. Ren, K. He, R. Girshick, and J. Sun. ”Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015

  27. [27]

    ”Going deeper with convolutions.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp

    Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. ”Going deeper with convolutions.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. 2015

  28. [28]

    T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ar,and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740755. Springer,2014

  29. [29]

    Huang, V

    J. Huang, V . Rathod, C. Sun, M.g Zhu, A. Korattikara, A. Fathi, I. Fischer et al. ”Speed/accuracy trade-offs for modern convolutional object detectors.” In IEEE CVPR, vol. 4. 2017