Where are the Masks: Instance Segmentation with Image-level Supervision
Pith reviewed 2026-05-25 10:58 UTC · model grok-4.3
The pith
A two-stage pipeline generates pseudo masks from image-level labels to train Mask R-CNN for instance segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a simple two-stage framework—first training a classifier to generate pseudo masks for objects from image-level labels, then training a fully supervised Mask R-CNN on those pseudo masks—achieves new state-of-the-art mean average precision for instance segmentation under image-level supervision on PASCAL VOC 2012.
What carries the argument
The two-stage pipeline that converts outputs from an image-level classifier into pseudo masks usable as training data for Mask R-CNN.
If this is right
- Instance segmentation training becomes possible using only image tags that can be gathered with minimal effort such as web searches.
- The method delivers major gains in mean average precision compared with prior approaches on the same supervision level.
- The pipeline remains simple to implement and works with different underlying segmentation architectures.
- New state-of-the-art results are obtained for image-level supervised instance segmentation on PASCAL VOC 2012.
Where Pith is reading between the lines
- The same pseudo-mask generation step could be tested on larger or more diverse image collections beyond the VOC benchmark.
- Iterating the classifier and segmenter stages might further refine the pseudo masks and raise final accuracy.
- The approach opens a route to training on web-scale tagged photos where pixel labels are unavailable.
- Cost savings in annotation could make instance segmentation practical for domains that currently lack detailed datasets.
Load-bearing premise
The pseudo masks produced by the image-level classifier must be accurate enough to serve as effective training data for the Mask R-CNN without introducing too much noise or bias.
What would settle it
Running the full pipeline on PASCAL VOC 2012 and obtaining mAP no higher than existing weakly supervised methods would falsify the performance claim.
Figures
read the original abstract
A major obstacle in instance segmentation is that existing methods often need many per-pixel labels in order to be effective. These labels require large human effort and for certain applications, such labels are not readily available. To address this limitation, we propose a novel framework that can effectively train with image-level labels, which are significantly cheaper to acquire. For instance, one can do an internet search for the term "car" and obtain many images where a car is present with minimal effort. Our framework consists of two stages: (1) train a classifier to generate pseudo masks for the objects of interest; (2) train a fully supervised Mask R-CNN on these pseudo masks. Our two main contribution are proposing a pipeline that is simple to implement and is amenable to different segmentation methods; and achieves new state-of-the-art results for this problem setup. Our results are based on evaluating our method on PASCAL VOC 2012, a standard dataset for weakly supervised methods, where we demonstrate major performance gains compared to existing methods with respect to mean average precision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage framework for instance segmentation under image-level supervision only. Stage (1) trains a classifier to produce pseudo masks; stage (2) trains a standard Mask R-CNN on those pseudo masks. The central claim is that this simple pipeline is easy to implement, works with different segmentation backbones, and delivers new state-of-the-art mAP on PASCAL VOC 2012 with major gains over prior weakly-supervised methods.
Significance. If the pseudo-mask quality is adequate, the approach would offer a practical route to instance segmentation with far cheaper labels than pixel-level annotation. The emphasis on simplicity and compatibility with existing fully-supervised detectors is a genuine strength. However, the manuscript supplies neither quantitative results nor any direct measurement of pseudo-mask fidelity, so the significance cannot yet be assessed.
major comments (3)
- [Abstract] Abstract: the claims of 'new state-of-the-art results' and 'major performance gains' with respect to mAP are stated without any numerical values, tables, or comparisons to prior methods, preventing verification of the central performance claim.
- [Framework description] Framework description (stage 1): no details are given on the classifier architecture, the pseudo-mask generation procedure, or any direct evaluation of mask quality (e.g., per-instance IoU against ground truth). This quantity is load-bearing for the claim that the generated masks can serve as training data for Mask R-CNN without prohibitive noise.
- [Evaluation] Evaluation: the manuscript contains no ablation studies that isolate the effect of pseudo-mask noise on the second-stage detector or that quantify how much the reported mAP depends on the quality of the stage-1 masks.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that improve clarity and completeness without altering the core claims or experimental setup.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims of 'new state-of-the-art results' and 'major performance gains' with respect to mAP are stated without any numerical values, tables, or comparisons to prior methods, preventing verification of the central performance claim.
Authors: We agree the abstract would be stronger with explicit numbers. The body of the manuscript contains tables reporting mAP on PASCAL VOC 2012 with direct comparisons to prior weakly-supervised methods. In revision we will insert the key numerical results (our mAP and the margin over the previous best) into the abstract while respecting length limits. revision: yes
-
Referee: [Framework description] Framework description (stage 1): no details are given on the classifier architecture, the pseudo-mask generation procedure, or any direct evaluation of mask quality (e.g., per-instance IoU against ground truth). This quantity is load-bearing for the claim that the generated masks can serve as training data for Mask R-CNN without prohibitive noise.
Authors: We will expand Section 3 (stage 1) with the precise classifier architecture, the full pseudo-mask generation algorithm, and a quantitative assessment of mask fidelity via per-instance IoU on a held-out set. These additions will directly support the claim that the pseudo masks are usable for the second stage. revision: yes
-
Referee: [Evaluation] Evaluation: the manuscript contains no ablation studies that isolate the effect of pseudo-mask noise on the second-stage detector or that quantify how much the reported mAP depends on the quality of the stage-1 masks.
Authors: We will add a new ablation subsection that varies pseudo-mask quality (via controlled noise injection or threshold sweeps) and reports the resulting change in final Mask R-CNN mAP. This will quantify sensitivity to stage-1 mask fidelity. revision: yes
Circularity Check
No circularity; standard empirical two-stage pipeline on external benchmark
full rationale
The manuscript presents an empirical pipeline (image-level classifier produces pseudo-masks; Mask R-CNN trained on those masks) and reports mAP on PASCAL VOC 2012. No equations, fitted parameters renamed as predictions, self-citations used as uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation. The central claim rests on end-to-end experimental results rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our framework consists of two stages: (1) train a classifier to generate pseudo masks... (2) train a fully supervised Mask R-CNN on these pseudo masks.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieves new state-of-the-art results... major performance gains compared to existing methods with respect to mean average precision
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. CVPR, 2018
work page 2018
-
[2]
Multiscale combinatorial grouping
Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, 2014
work page 2014
-
[3]
Whatâ ˘A ´Zs the point: Semantic segmentation with point supervision
Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. Whatâ ˘A ´Zs the point: Semantic segmentation with point supervision. In ECCV, 2016
work page 2016
-
[4]
Weakly supervised deep detection networks
Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016
work page 2016
-
[5]
Yolact: Real-time instance segmentation
Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. arXiv preprint arXiv:1904.02689, 2019
-
[6]
Masklab: Instance segmentation by refining object detec- tion with semantic and direction features
Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. Masklab: Instance segmentation by refining object detec- tion with semantic and direction features. In CVPR, 2018
work page 2018
-
[7]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2018
work page 2018
-
[8]
Object counting and instance segmentation with image-level supervision
Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, and Ling Shao. Object counting and instance segmentation with image-level supervision. In CVPR, 2019
work page 2019
-
[9]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016
work page 2016
-
[10]
Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation
Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. CoRR, 2015
work page 2015
-
[11]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009
work page 2009
-
[12]
The pascal visual object classes (voc) challenge
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010
work page 2010
-
[13]
Cheng-Yang Fu, Mykhailo Shvets, and Alexander C. Berg. RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. InarXiv preprint arXiv:1901.03353, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. LARADJI, V AZQUEZ, & SCHMIDT: WHERE ARE THE MASKS 11
work page 2016
-
[15]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017
work page 2017
-
[16]
What makes for ef- fective detection proposals? T-PAMI, 2016
Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What makes for ef- fective detection proposals? T-PAMI, 2016
work page 2016
-
[17]
Simple does it: Weakly supervised instance and semantic segmentation
Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017
work page 2017
-
[18]
Seed, expand and constrain: Three principles for weakly-supervised image segmentation
Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016
work page 2016
-
[19]
Konopczynski, Thorben Kröger, Lei Zheng, and Jürgen Hesser
Tomasz K. Konopczynski, Thorben Kröger, Lei Zheng, and Jürgen Hesser. Instance segmentation of fibers from low resolution ct scans via 3d deep embedding learning. In BMVC, 2018
work page 2018
-
[20]
Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification
Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. PAMI, 40(7):1533–1554, 2018
work page 2018
-
[21]
Where are the blobs: Counting by localization with point supervision
Issam H Laradji, Negar Rostamzadeh, Pedro O Pinheiro, David Vazquez, and Mark Schmidt. Where are the blobs: Counting by localization with point supervision. In ECCV, 2018
work page 2018
-
[22]
Instance Segmentation with Point Supervision
Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro, David Vázquez, and Mark W. Schmidt. Instance segmentation with point supervision. ArXiv, abs/1906.06392, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[23]
Scribblesup: Scribble- supervised convolutional networks for semantic segmentation
Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble- supervised convolutional networks for semantic segmentation. In CVPR, 2016
work page 2016
-
[24]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra- manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014
work page 2014
-
[25]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2016
work page 2016
-
[26]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015
work page 2015
-
[27]
Convo- lutional oriented boundaries
Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Pablo Arbeláez, and Luc Van Gool. Convo- lutional oriented boundaries. In ECCV, 2016
work page 2016
-
[28]
Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, mod- ular reference implementation of Instance Segmentation and Object Detec- tion algorithms in PyTorch. https://github.com/facebookresearch/ maskrcnn-benchmark, 2018. Accessed: March 12th 2019
work page 2018
-
[29]
From image-level to pixel-level labeling with convolutional networks
Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015. 12 LARADJI, V AZQUEZ, & SCHMIDT: WHERE ARE THE MASKS
work page 2015
-
[30]
Learning to segment object can- didates
Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object can- didates. In NIPS, 2015
work page 2015
-
[31]
Learning to refine object segments
Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to refine object segments. In ECCV, 2016
work page 2016
-
[32]
Segmentation of medical images using adaptive region growing
Regina Pohle and Klaus D Toennies. Segmentation of medical images using adaptive region growing. In MIIP, 2001
work page 2001
-
[33]
Boosting object proposals: From pascal to coco
Jordi Pont-Tuset and Luc Van Gool. Boosting object proposals: From pascal to coco. In CVPR, 2015
work page 2015
-
[34]
Augmented feedback in semantic segmentation under image level supervision
Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, Hengshuang Zhao, and Jiaya Jia. Augmented feedback in semantic segmentation under image level supervision. In ECCV, 2016
work page 2016
-
[35]
End-to-end instance segmentation with recurrent attention
Mengye Ren and Richard S Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017
work page 2017
-
[36]
Faster r-cnn: Towards real- time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. In NIPS, 2015
work page 2015
-
[37]
Recurrent instance segmen- tation
Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmen- tation. In ECCV, 2016
work page 2016
-
[38]
Grad-cam: Visual explanations from deep networks via gradient-based localization
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017
work page 2017
-
[39]
Multiple instance detection network with online instance classifier refinement
Peng Tang, Xinggang Wang, Xiang Bai, and Wenyu Liu. Multiple instance detection network with online instance classifier refinement. InCVPR, 2017
work page 2017
-
[40]
Weakly supervised region proposal network and object de- tection
Peng Tang, Xinggang Wang, Angtian Wang, Yongluan Yan, Wenyu Liu, Junzhou Huang, and Alan Yuille. Weakly supervised region proposal network and object de- tection. In ECCV, 2018
work page 2018
-
[41]
Selective search for object recognition
Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. In ICCV, 2013
work page 2013
-
[42]
Learning deep features for discriminative localization
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016
work page 2016
-
[43]
Weakly supervised instance segmentation using class peak response
Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation using class peak response. In CVPR, 2018
work page 2018
-
[44]
Edge boxes: Locating object proposals from edges
C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.