Where are the Masks: Instance Segmentation with Image-level Supervision

David Vazquez; Issam H. Laradji; Mark Schmidt

arxiv: 1907.01430 · v1 · pith:B7AVZCJCnew · submitted 2019-07-02 · 💻 cs.CV · cs.LG· eess.IV

Where are the Masks: Instance Segmentation with Image-level Supervision

Issam H. Laradji , David Vazquez , Mark Schmidt This is my paper

Pith reviewed 2026-05-25 10:58 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords instance segmentationweakly supervisedimage-level labelspseudo masksMask R-CNNPASCAL VOCobject segmentation

0 comments

The pith

A two-stage pipeline generates pseudo masks from image-level labels to train Mask R-CNN for instance segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that instance segmentation models, which normally demand costly per-pixel annotations, can instead be trained from far cheaper image-level labels alone. It proposes first training a classifier on those labels to produce approximate object masks, then feeding the masks as supervision into a Mask R-CNN. A reader would care because this approach could scale training to large unlabeled image collections obtained via simple searches, cutting the human effort required while still delivering usable accuracy. The work evaluates the idea on the standard PASCAL VOC 2012 benchmark and reports gains over earlier weakly supervised methods.

Core claim

The central claim is that a simple two-stage framework—first training a classifier to generate pseudo masks for objects from image-level labels, then training a fully supervised Mask R-CNN on those pseudo masks—achieves new state-of-the-art mean average precision for instance segmentation under image-level supervision on PASCAL VOC 2012.

What carries the argument

The two-stage pipeline that converts outputs from an image-level classifier into pseudo masks usable as training data for Mask R-CNN.

If this is right

Instance segmentation training becomes possible using only image tags that can be gathered with minimal effort such as web searches.
The method delivers major gains in mean average precision compared with prior approaches on the same supervision level.
The pipeline remains simple to implement and works with different underlying segmentation architectures.
New state-of-the-art results are obtained for image-level supervised instance segmentation on PASCAL VOC 2012.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pseudo-mask generation step could be tested on larger or more diverse image collections beyond the VOC benchmark.
Iterating the classifier and segmenter stages might further refine the pseudo masks and raise final accuracy.
The approach opens a route to training on web-scale tagged photos where pixel labels are unavailable.
Cost savings in annotation could make instance segmentation practical for domains that currently lack detailed datasets.

Load-bearing premise

The pseudo masks produced by the image-level classifier must be accurate enough to serve as effective training data for the Mask R-CNN without introducing too much noise or bias.

What would settle it

Running the full pipeline on PASCAL VOC 2012 and obtaining mAP no higher than existing weakly supervised methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.01430 by David Vazquez, Issam H. Laradji, Mark Schmidt.

**Figure 1.** Figure 1: Framework overview. Our Weakly-supervised Instance SEgmentation (WISE) method learns to perform instance segmentation with image-level supervision. First, a classifier is trained with a peak stimulation layer to identify peaks at which the objects are located (row 2). A proposal gallery (such as MCG [2]) is used to obtain rough masks for the located objects, which are then used as pseudo masks to train Ma… view at source ↗

**Figure 2.** Figure 2: WISE training. The first component (shown in blue) learns to classify the images in the dataset. The classifier first outputs a class activation map (CAM); then, obtains CAM’s local maximas using a peak stimulation layer (PSL). To train the classifier, the classification loss is computed using the average of these local maximas. As the CAM peaks represent located objects, we select a proposal for each of t… view at source ↗

**Figure 3.** Figure 3: Inference. At test time, only the trained Mask-RCNN is required to output the prediction masks in the image. As an optional refinement step, the predicted masks can be replaced with the object proposals of highest Jaccard similarity. (shown as green components in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results. Qualitative results of WISE on PASCAL VOC 2012 val. set. The images illustrate the predicted masks of the trained Mask R-CNN for different classes. margin with respect to Average Best Overlap (ABO) [33], mAP25, mAP50, and mAP75. Further, WISE without refinement also beats current state-of-the-art. Even more so, our method outperforms Cholakkal et al. [8] which uses slightly stronger la… view at source ↗

**Figure 5.** Figure 5: Statistical Analysis. The left figure illustrates the performance of WISE and a Mask R-CNN trained on per-pixel labels across various object sizes; and the right figure illustrates the same benchmark but across images with different number of objects. the number of identified objects in the images. These results are summarized in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

A major obstacle in instance segmentation is that existing methods often need many per-pixel labels in order to be effective. These labels require large human effort and for certain applications, such labels are not readily available. To address this limitation, we propose a novel framework that can effectively train with image-level labels, which are significantly cheaper to acquire. For instance, one can do an internet search for the term "car" and obtain many images where a car is present with minimal effort. Our framework consists of two stages: (1) train a classifier to generate pseudo masks for the objects of interest; (2) train a fully supervised Mask R-CNN on these pseudo masks. Our two main contribution are proposing a pipeline that is simple to implement and is amenable to different segmentation methods; and achieves new state-of-the-art results for this problem setup. Our results are based on evaluating our method on PASCAL VOC 2012, a standard dataset for weakly supervised methods, where we demonstrate major performance gains compared to existing methods with respect to mean average precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straightforward two-stage pipeline using image-level labels to train Mask R-CNN via pseudo masks, but the lack of direct checks on mask quality leaves the performance claims hard to evaluate.

read the letter

The paper's core idea is a two-stage setup: train a classifier on image-level labels to produce pseudo masks, then feed those into a standard Mask R-CNN. This is presented as simple to implement and compatible with other segmentation backbones, with a claim of new state-of-the-art mAP on PASCAL VOC 2012 under image-level supervision. That direction makes sense given how much cheaper image tags are than pixel labels, and the pipeline itself does not introduce exotic components. Credit to the authors for keeping it practical rather than over-engineered. The main weakness is that everything hinges on the pseudo masks being good enough training data. The description gives end-to-end mAP numbers but no per-instance IoU measurements against ground truth for the masks themselves, and no ablations that swap in cleaner or noisier masks to show the effect. Without those, it is difficult to tell whether the reported gains come from the method or from lucky pseudo-mask quality on this particular dataset. The citation pattern looks standard for the weakly supervised segmentation literature. This is the kind of paper that would interest people already working on reducing annotation costs in detection and segmentation. A reader who wants a concrete, runnable baseline for image-level instance segmentation could get value from the implementation details. It is coherent enough on its own terms to deserve a serious referee, even if the review would likely ask for the missing mask-quality diagnostics and more controlled experiments.

Referee Report

3 major / 0 minor

Summary. The paper proposes a two-stage framework for instance segmentation under image-level supervision only. Stage (1) trains a classifier to produce pseudo masks; stage (2) trains a standard Mask R-CNN on those pseudo masks. The central claim is that this simple pipeline is easy to implement, works with different segmentation backbones, and delivers new state-of-the-art mAP on PASCAL VOC 2012 with major gains over prior weakly-supervised methods.

Significance. If the pseudo-mask quality is adequate, the approach would offer a practical route to instance segmentation with far cheaper labels than pixel-level annotation. The emphasis on simplicity and compatibility with existing fully-supervised detectors is a genuine strength. However, the manuscript supplies neither quantitative results nor any direct measurement of pseudo-mask fidelity, so the significance cannot yet be assessed.

major comments (3)

[Abstract] Abstract: the claims of 'new state-of-the-art results' and 'major performance gains' with respect to mAP are stated without any numerical values, tables, or comparisons to prior methods, preventing verification of the central performance claim.
[Framework description] Framework description (stage 1): no details are given on the classifier architecture, the pseudo-mask generation procedure, or any direct evaluation of mask quality (e.g., per-instance IoU against ground truth). This quantity is load-bearing for the claim that the generated masks can serve as training data for Mask R-CNN without prohibitive noise.
[Evaluation] Evaluation: the manuscript contains no ablation studies that isolate the effect of pseudo-mask noise on the second-stage detector or that quantify how much the reported mAP depends on the quality of the stage-1 masks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that improve clarity and completeness without altering the core claims or experimental setup.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'new state-of-the-art results' and 'major performance gains' with respect to mAP are stated without any numerical values, tables, or comparisons to prior methods, preventing verification of the central performance claim.

Authors: We agree the abstract would be stronger with explicit numbers. The body of the manuscript contains tables reporting mAP on PASCAL VOC 2012 with direct comparisons to prior weakly-supervised methods. In revision we will insert the key numerical results (our mAP and the margin over the previous best) into the abstract while respecting length limits. revision: yes
Referee: [Framework description] Framework description (stage 1): no details are given on the classifier architecture, the pseudo-mask generation procedure, or any direct evaluation of mask quality (e.g., per-instance IoU against ground truth). This quantity is load-bearing for the claim that the generated masks can serve as training data for Mask R-CNN without prohibitive noise.

Authors: We will expand Section 3 (stage 1) with the precise classifier architecture, the full pseudo-mask generation algorithm, and a quantitative assessment of mask fidelity via per-instance IoU on a held-out set. These additions will directly support the claim that the pseudo masks are usable for the second stage. revision: yes
Referee: [Evaluation] Evaluation: the manuscript contains no ablation studies that isolate the effect of pseudo-mask noise on the second-stage detector or that quantify how much the reported mAP depends on the quality of the stage-1 masks.

Authors: We will add a new ablation subsection that varies pseudo-mask quality (via controlled noise injection or threshold sweeps) and reports the resulting change in final Mask R-CNN mAP. This will quantify sensitivity to stage-1 mask fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity; standard empirical two-stage pipeline on external benchmark

full rationale

The manuscript presents an empirical pipeline (image-level classifier produces pseudo-masks; Mask R-CNN trained on those masks) and reports mAP on PASCAL VOC 2012. No equations, fitted parameters renamed as predictions, self-citations used as uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation. The central claim rests on end-to-end experimental results rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any fitted parameters, axioms, or new entities; the method builds on existing Mask R-CNN and classifier training with standard deep learning assumptions.

pith-pipeline@v0.9.0 · 5719 in / 1211 out tokens · 35395 ms · 2026-05-25T10:58:40.590828+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework consists of two stages: (1) train a classifier to generate pseudo masks... (2) train a fully supervised Mask R-CNN on these pseudo masks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieves new state-of-the-art results... major performance gains compared to existing methods with respect to mean average precision

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Learning pixel-level semantic afﬁnity with image-level supervision for weakly supervised semantic segmentation

Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic afﬁnity with image-level supervision for weakly supervised semantic segmentation. CVPR, 2018

work page 2018
[2]

Multiscale combinatorial grouping

Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, 2014

work page 2014
[3]

Whatâ ˘A ´Zs the point: Semantic segmentation with point supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. Whatâ ˘A ´Zs the point: Semantic segmentation with point supervision. In ECCV, 2016

work page 2016
[4]

Weakly supervised deep detection networks

Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016

work page 2016
[5]

Yolact: Real-time instance segmentation

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. arXiv preprint arXiv:1904.02689, 2019

work page arXiv 1904
[6]

Masklab: Instance segmentation by reﬁning object detec- tion with semantic and direction features

Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. Masklab: Instance segmentation by reﬁning object detec- tion with semantic and direction features. In CVPR, 2018

work page 2018
[7]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2018

work page 2018
[8]

Object counting and instance segmentation with image-level supervision

Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, and Ling Shao. Object counting and instance segmentation with image-level supervision. In CVPR, 2019

work page 2019
[9]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

work page 2016
[10]

Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation

Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. CoRR, 2015

work page 2015
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[12]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010

work page 2010
[13]

Cheng-Yang Fu, Mykhailo Shvets, and Alexander C. Berg. RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. InarXiv preprint arXiv:1901.03353, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. LARADJI, V AZQUEZ, & SCHMIDT: WHERE ARE THE MASKS 11

work page 2016
[15]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017

work page 2017
[16]

What makes for ef- fective detection proposals? T-PAMI, 2016

Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What makes for ef- fective detection proposals? T-PAMI, 2016

work page 2016
[17]

Simple does it: Weakly supervised instance and semantic segmentation

Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017

work page 2017
[18]

Seed, expand and constrain: Three principles for weakly-supervised image segmentation

Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016

work page 2016
[19]

Konopczynski, Thorben Kröger, Lei Zheng, and Jürgen Hesser

Tomasz K. Konopczynski, Thorben Kröger, Lei Zheng, and Jürgen Hesser. Instance segmentation of ﬁbers from low resolution ct scans via 3d deep embedding learning. In BMVC, 2018

work page 2018
[20]

Analysis and optimization of loss functions for multiclass, top-k, and multilabel classiﬁcation

Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classiﬁcation. PAMI, 40(7):1533–1554, 2018

work page 2018
[21]

Where are the blobs: Counting by localization with point supervision

Issam H Laradji, Negar Rostamzadeh, Pedro O Pinheiro, David Vazquez, and Mark Schmidt. Where are the blobs: Counting by localization with point supervision. In ECCV, 2018

work page 2018
[22]

Instance Segmentation with Point Supervision

Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro, David Vázquez, and Mark W. Schmidt. Instance segmentation with point supervision. ArXiv, abs/1906.06392, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[23]

Scribblesup: Scribble- supervised convolutional networks for semantic segmentation

Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble- supervised convolutional networks for semantic segmentation. In CVPR, 2016

work page 2016
[24]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra- manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[25]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2016

work page 2016
[26]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015
[27]

Convo- lutional oriented boundaries

Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Pablo Arbeláez, and Luc Van Gool. Convo- lutional oriented boundaries. In ECCV, 2016

work page 2016
[28]

maskrcnn-benchmark: Fast, mod- ular reference implementation of Instance Segmentation and Object Detec- tion algorithms in PyTorch

Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, mod- ular reference implementation of Instance Segmentation and Object Detec- tion algorithms in PyTorch. https://github.com/facebookresearch/ maskrcnn-benchmark, 2018. Accessed: March 12th 2019

work page 2018
[29]

From image-level to pixel-level labeling with convolutional networks

Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015. 12 LARADJI, V AZQUEZ, & SCHMIDT: WHERE ARE THE MASKS

work page 2015
[30]

Learning to segment object can- didates

Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object can- didates. In NIPS, 2015

work page 2015
[31]

Learning to reﬁne object segments

Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to reﬁne object segments. In ECCV, 2016

work page 2016
[32]

Segmentation of medical images using adaptive region growing

Regina Pohle and Klaus D Toennies. Segmentation of medical images using adaptive region growing. In MIIP, 2001

work page 2001
[33]

Boosting object proposals: From pascal to coco

Jordi Pont-Tuset and Luc Van Gool. Boosting object proposals: From pascal to coco. In CVPR, 2015

work page 2015
[34]

Augmented feedback in semantic segmentation under image level supervision

Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, Hengshuang Zhao, and Jiaya Jia. Augmented feedback in semantic segmentation under image level supervision. In ECCV, 2016

work page 2016
[35]

End-to-end instance segmentation with recurrent attention

Mengye Ren and Richard S Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017

work page 2017
[36]

Faster r-cnn: Towards real- time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. In NIPS, 2015

work page 2015
[37]

Recurrent instance segmen- tation

Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmen- tation. In ECCV, 2016

work page 2016
[38]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017

work page 2017
[39]

Multiple instance detection network with online instance classiﬁer reﬁnement

Peng Tang, Xinggang Wang, Xiang Bai, and Wenyu Liu. Multiple instance detection network with online instance classiﬁer reﬁnement. InCVPR, 2017

work page 2017
[40]

Weakly supervised region proposal network and object de- tection

Peng Tang, Xinggang Wang, Angtian Wang, Yongluan Yan, Wenyu Liu, Junzhou Huang, and Alan Yuille. Weakly supervised region proposal network and object de- tection. In ECCV, 2018

work page 2018
[41]

Selective search for object recognition

Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. In ICCV, 2013

work page 2013
[42]

Learning deep features for discriminative localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016

work page 2016
[43]

Weakly supervised instance segmentation using class peak response

Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation using class peak response. In CVPR, 2018

work page 2018
[44]

Edge boxes: Locating object proposals from edges

C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014

work page 2014

[1] [1]

Learning pixel-level semantic afﬁnity with image-level supervision for weakly supervised semantic segmentation

Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic afﬁnity with image-level supervision for weakly supervised semantic segmentation. CVPR, 2018

work page 2018

[2] [2]

Multiscale combinatorial grouping

Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, 2014

work page 2014

[3] [3]

Whatâ ˘A ´Zs the point: Semantic segmentation with point supervision

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. Whatâ ˘A ´Zs the point: Semantic segmentation with point supervision. In ECCV, 2016

work page 2016

[4] [4]

Weakly supervised deep detection networks

Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016

work page 2016

[5] [5]

Yolact: Real-time instance segmentation

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. arXiv preprint arXiv:1904.02689, 2019

work page arXiv 1904

[6] [6]

Masklab: Instance segmentation by reﬁning object detec- tion with semantic and direction features

Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. Masklab: Instance segmentation by reﬁning object detec- tion with semantic and direction features. In CVPR, 2018

work page 2018

[7] [7]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 40(4):834–848, 2018

work page 2018

[8] [8]

Object counting and instance segmentation with image-level supervision

Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, and Ling Shao. Object counting and instance segmentation with image-level supervision. In CVPR, 2019

work page 2019

[9] [9]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

work page 2016

[10] [10]

Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation

Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. CoRR, 2015

work page 2015

[11] [11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009

[12] [12]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010

work page 2010

[13] [13]

Cheng-Yang Fu, Mykhailo Shvets, and Alexander C. Berg. RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. InarXiv preprint arXiv:1901.03353, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. LARADJI, V AZQUEZ, & SCHMIDT: WHERE ARE THE MASKS 11

work page 2016

[15] [15]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017

work page 2017

[16] [16]

What makes for ef- fective detection proposals? T-PAMI, 2016

Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What makes for ef- fective detection proposals? T-PAMI, 2016

work page 2016

[17] [17]

Simple does it: Weakly supervised instance and semantic segmentation

Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017

work page 2017

[18] [18]

Seed, expand and constrain: Three principles for weakly-supervised image segmentation

Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016

work page 2016

[19] [19]

Konopczynski, Thorben Kröger, Lei Zheng, and Jürgen Hesser

Tomasz K. Konopczynski, Thorben Kröger, Lei Zheng, and Jürgen Hesser. Instance segmentation of ﬁbers from low resolution ct scans via 3d deep embedding learning. In BMVC, 2018

work page 2018

[20] [20]

Analysis and optimization of loss functions for multiclass, top-k, and multilabel classiﬁcation

Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classiﬁcation. PAMI, 40(7):1533–1554, 2018

work page 2018

[21] [21]

Where are the blobs: Counting by localization with point supervision

Issam H Laradji, Negar Rostamzadeh, Pedro O Pinheiro, David Vazquez, and Mark Schmidt. Where are the blobs: Counting by localization with point supervision. In ECCV, 2018

work page 2018

[22] [22]

Instance Segmentation with Point Supervision

Issam H. Laradji, Negar Rostamzadeh, Pedro O. Pinheiro, David Vázquez, and Mark W. Schmidt. Instance segmentation with point supervision. ArXiv, abs/1906.06392, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[23] [23]

Scribblesup: Scribble- supervised convolutional networks for semantic segmentation

Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble- supervised convolutional networks for semantic segmentation. In CVPR, 2016

work page 2016

[24] [24]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra- manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014

[25] [25]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2016

work page 2016

[26] [26]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015

[27] [27]

Convo- lutional oriented boundaries

Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Pablo Arbeláez, and Luc Van Gool. Convo- lutional oriented boundaries. In ECCV, 2016

work page 2016

[28] [28]

maskrcnn-benchmark: Fast, mod- ular reference implementation of Instance Segmentation and Object Detec- tion algorithms in PyTorch

Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, mod- ular reference implementation of Instance Segmentation and Object Detec- tion algorithms in PyTorch. https://github.com/facebookresearch/ maskrcnn-benchmark, 2018. Accessed: March 12th 2019

work page 2018

[29] [29]

From image-level to pixel-level labeling with convolutional networks

Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015. 12 LARADJI, V AZQUEZ, & SCHMIDT: WHERE ARE THE MASKS

work page 2015

[30] [30]

Learning to segment object can- didates

Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object can- didates. In NIPS, 2015

work page 2015

[31] [31]

Learning to reﬁne object segments

Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to reﬁne object segments. In ECCV, 2016

work page 2016

[32] [32]

Segmentation of medical images using adaptive region growing

Regina Pohle and Klaus D Toennies. Segmentation of medical images using adaptive region growing. In MIIP, 2001

work page 2001

[33] [33]

Boosting object proposals: From pascal to coco

Jordi Pont-Tuset and Luc Van Gool. Boosting object proposals: From pascal to coco. In CVPR, 2015

work page 2015

[34] [34]

Augmented feedback in semantic segmentation under image level supervision

Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, Hengshuang Zhao, and Jiaya Jia. Augmented feedback in semantic segmentation under image level supervision. In ECCV, 2016

work page 2016

[35] [35]

End-to-end instance segmentation with recurrent attention

Mengye Ren and Richard S Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017

work page 2017

[36] [36]

Faster r-cnn: Towards real- time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. In NIPS, 2015

work page 2015

[37] [37]

Recurrent instance segmen- tation

Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmen- tation. In ECCV, 2016

work page 2016

[38] [38]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017

work page 2017

[39] [39]

Multiple instance detection network with online instance classiﬁer reﬁnement

Peng Tang, Xinggang Wang, Xiang Bai, and Wenyu Liu. Multiple instance detection network with online instance classiﬁer reﬁnement. InCVPR, 2017

work page 2017

[40] [40]

Weakly supervised region proposal network and object de- tection

Peng Tang, Xinggang Wang, Angtian Wang, Yongluan Yan, Wenyu Liu, Junzhou Huang, and Alan Yuille. Weakly supervised region proposal network and object de- tection. In ECCV, 2018

work page 2018

[41] [41]

Selective search for object recognition

Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. In ICCV, 2013

work page 2013

[42] [42]

Learning deep features for discriminative localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016

work page 2016

[43] [43]

Weakly supervised instance segmentation using class peak response

Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation using class peak response. In CVPR, 2018

work page 2018

[44] [44]

Edge boxes: Locating object proposals from edges

C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014

work page 2014