Cascade R-CNN: High Quality Object Detection and Instance Segmentation

Nuno Vasconcelos; Zhaowei Cai

arxiv: 1906.09756 · v1 · pith:EAQHZPR3new · submitted 2019-06-24 · 💻 cs.CV

Cascade R-CNN: High Quality Object Detection and Instance Segmentation

Zhaowei Cai , Nuno Vasconcelos This is my paper

Pith reviewed 2026-05-25 17:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords cascade r-cnnobject detectioninstance segmentationiou thresholdhigh-quality detectioncoco datasetmask r-cnnmulti-stage detector

0 comments

The pith

Cascade R-CNN trains a sequence of detectors at rising IoU thresholds to overcome overfitting and quality mismatch in high-quality object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard object detectors suffer when IoU thresholds rise above the usual 0.5 because positive training samples vanish and test-time hypotheses do not match the training quality. The paper introduces a multi-stage cascade where each detector's outputs are resampled to train the next detector at a higher IoU threshold. This resampling keeps the number of positive examples roughly constant across stages while steadily refining hypothesis quality. The same cascade structure is reused at inference time to align detector strength with the quality of incoming boxes. The resulting architecture improves detection on COCO and several other datasets and extends directly to instance segmentation.

Core claim

A multi-stage object detection architecture composed of a sequence of detectors trained with increasing IoU thresholds addresses the overfitting and quality-mismatch problems that arise at high IoU. Detectors are trained sequentially by using the output hypotheses of one stage as the training set for the next, which progressively improves hypothesis quality, guarantees a positive training set of equivalent size for all detectors, and minimizes overfitting. The same cascade is applied at inference to eliminate quality mismatches between hypotheses and detectors.

What carries the argument

The cascade of detectors, where each stage is trained on resampled outputs from the previous stage using a higher IoU threshold, which both refines quality and preserves training-set balance.

If this is right

An implementation without extra components reaches state-of-the-art performance on the COCO dataset.
High-quality detection improves substantially on VOC, KITTI, CityPerson, and WiderFace.
The same cascade structure yields nontrivial gains when applied to instance segmentation over Mask R-CNN.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The resampling mechanism may reduce reliance on heavy post-processing or ensemble methods for refining box quality.
Because each stage operates on a cleaner set of hypotheses, the cascade could be inserted into other two-stage or multi-stage detectors with minimal architectural change.
The approach suggests that quality alignment between training and inference is a general lever worth testing in related tasks such as keypoint detection or 3D bounding-box regression.

Load-bearing premise

Resampling the output hypotheses of one detector as training input for the next stage will guarantee a positive training set of equivalent size for all detectors while progressively improving hypothesis quality without introducing new training instabilities or selection biases.

What would settle it

A controlled experiment in which adding further cascade stages produces no gain or a drop in average precision at IoU thresholds of 0.75 and above on a fixed validation set would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.09756 by Nuno Vasconcelos, Zhaowei Cai.

**Figure 2.** Figure 2: Bounding box localization, classification loss and d [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The architectures of different frameworks. “I” is in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: IoU histograms of training samples of each cascade st [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of the distance vector ∆ of (4) (without normalization) at different cascade stages. Top: plot of (δx, δy). Bottom: plot of (δw, δh). Red dots are outliers for the increasing IoU thresholds of later stages, and the statistics shown are obtained after outlier removal. where T is the total number of cascade stages. The key point is that each regressor ft is optimized for the bounding box distri… view at source ↗

**Figure 6.** Figure 6: Architectures of the Mask R-CNN (a) and three Cascade [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: (a) detection performance of individually trained d [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Detection performance of all Cascade R-CNN detector [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its \textit{quality}. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at \url{https://github.com/zhaoweicai/cascade-rcnn} (Caffe) and \url{https://github.com/zhaoweicai/Detectron-Cascade-RCNN} (Detectron).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Cascade R-CNN, a multi-stage object detection architecture consisting of a sequence of detectors trained with increasing IoU thresholds. Detectors are trained sequentially, using the output of one as the training set for the next to progressively improve hypothesis quality. This is claimed to guarantee equivalent positive sample sizes and minimize overfitting, while the cascade is also used at inference to match quality. The method achieves state-of-the-art on COCO and improves high-quality detection on VOC, KITTI, CityPerson, WiderFace, and generalizes to instance segmentation with gains over Mask R-CNN.

Significance. If the results hold, this offers a practical solution to the high-quality detection problem in object detection by addressing both overfitting from vanishing positives and inference mismatch. The release of two implementations (Caffe and Detectron) is a strength for reproducibility. The approach is general and shows nontrivial improvements on multiple datasets.

major comments (1)

[Abstract] Abstract: The assertion that sequential resampling 'guarantees a positive training set of equivalent size for all detectors and minimizing overfitting' is not self-evident, as the positive count is still determined by IoU > threshold with ground truth. If the hypothesis quality shift from the prior stage is insufficient, positives may still vanish at higher thresholds, and selection bias toward well-classified hypotheses by the prior detector could be introduced. This is load-bearing for the central claim of progressive improvement without new instabilities, and requires explicit verification (e.g., positive sample statistics per stage) in the experimental section.

minor comments (1)

[Abstract] Abstract: The phrase 'without bells or whistles' is informal; consider rephrasing for a formal journal.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the abstract claim. We address the concern point-by-point below and agree that adding explicit verification strengthens the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that sequential resampling 'guarantees a positive training set of equivalent size for all detectors and minimizing overfitting' is not self-evident, as the positive count is still determined by IoU > threshold with ground truth. If the hypothesis quality shift from the prior stage is insufficient, positives may still vanish at higher thresholds, and selection bias toward well-classified hypotheses by the prior detector could be introduced. This is load-bearing for the central claim of progressive improvement without new instabilities, and requires explicit verification (e.g., positive sample statistics per stage) in the experimental section.

Authors: We agree that the positive count is formally defined by the IoU threshold with ground truth and that the guarantee is not automatic. The paper's argument rests on the empirical observation that each stage's output hypotheses have measurably higher quality (higher average IoU with ground truth), which in practice preserves a comparable number of positives when the threshold is raised. We acknowledge that selection bias could in principle arise and that this should be verified rather than asserted. In the revised manuscript we will add a table or plot of positive-sample counts (and average IoU of positives) per stage on COCO to make the claim explicit and to allow readers to assess whether the quality shift is sufficient. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture evaluated on external benchmarks

full rationale

The paper proposes Cascade R-CNN as a multi-stage detector trained sequentially with increasing IoU thresholds, with the central claims resting on empirical performance gains on external public datasets (COCO, VOC, KITTI, etc.) rather than any internal fitted parameters, self-referential definitions, or reductions by construction. The resampling procedure is presented as a design choice whose benefits are asserted and then measured externally; no equations or derivations reduce the output to the input by definition. No load-bearing self-citations appear in the provided text. This matches the default expectation of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method depends on standard supervised deep-learning assumptions and the choice of specific IoU thresholds as hyperparameters; no new physical entities are introduced.

free parameters (1)

IoU thresholds per stage
Specific increasing values (commonly 0.5, 0.6, 0.7) are selected and tuned; these are free parameters that affect training set size and final performance.

axioms (2)

domain assumption Sequential supervised training on resampled hypotheses from the prior stage produces stable convergence for each detector.
Invoked when describing the training procedure that guarantees equivalent positive sample sizes.
domain assumption Standard backpropagation and data augmentation suffice for each stage without additional regularization tailored to the cascade.
Underlying assumption for the claim that the cascade minimizes overfitting.

pith-pipeline@v0.9.0 · 5825 in / 1481 out tokens · 37774 ms · 2026-05-25T17:54:00.632448+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

gen2seg: Generative Models Enable Generalizable Instance Segmentation
cs.CV 2025-05 unverdicted novelty 6.0

Finetuning generative models on limited instance segmentation data produces zero-shot generalization to unseen object categories and styles, matching or exceeding supervised baselines like SAM on ambiguous boundaries.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Bodla, B

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms - improving object detection with one line of code. In ICCV, pages 5562–5570, 2017. 3, 10

work page 2017
[2]

Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A uniﬁed mul ti- scale deep convolutional neural network for fast object det ection. In ECCV, pages 354–370, 2016. 3, 4, 5, 8, 12

work page 2016
[3]

Cai and N

Z. Cai and N. Vasconcelos. Cascade R-CNN: Delving into hi gh quality object detection. In CVPR, 2018. 3, 8

work page 2018
[4]

X. Cao, Y . Wei, F. Wen, and J. Sun. Face alignment by explici t shape regression. In CVPR, pages 2887–2894, 2012. 5

work page 2012
[5]

K. Chen, J. Pang, J. Wang, Y . Xiong, X. Li, S. Sun, W. Feng, Z. L iu, J. Shi, W. Ouyang, C. C. Loy , and D. Lin. Hybrid task cascade for instance segmentation. CoRR, abs/1901.07518, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1901
[6]

Cortes and V

C. Cortes and V . Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. 2

work page 1995
[7]

J. Dai, K. He, and J. Sun. Instance-aware semantic segment ation via multi-task network cascades. In CVPR, pages 3150–3158, 2016. 3, 7 14

work page 2016
[8]

J. Dai, Y . Li, K. He, and J. Sun. R-FCN: object detection via region- based fully convolutional networks. In NIPS, pages 379–387, 2016. 3, 4, 8, 11

work page 2016
[9]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In ICCV, 2017. 10, 11

work page 2017
[10]

Doll´ ar, P

P . Doll´ ar, P . Welinder, and P . Perona. Cascaded pose regression. In CVPR, pages 1078–1085, 2010. 5

work page 2010
[11]

Doll´ ar, C

P . Doll´ ar, C. Wojek, B. Schiele, and P . Perona. Pedestrian detection: An evaluation of the state of the art. IEEE T rans. Pattern Anal. Mach. Intell., 34(4):743–761, 2012. 5

work page 2012
[12]

C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973–978, 2001. 1

work page 2001
[13]

Everingham, L

M. Everingham, L. J. V . Gool, C. K. I. Williams, J. M. Winn , and A. Zisserman. The pascal visual object classes (VOC) challe nge. International Journal of Computer Vision , 88(2):303–338, 2010. 3, 5, 8, 12

work page 2010
[14]

P . F. Felzenszwalb, R. B. Girshick, D. A. McAllester, an d D. Ra- manan. Object detection with discriminatively trained par t-based models. IEEE T rans. Pattern Anal. Mach. Intell. , 32(9):1627–1645,

work page
[15]

Freund and R

Y . Freund and R. E. Schapire. A decision-theoretic gener alization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. 2

work page 1995
[16]

Geiger, P

A. Geiger, P . Lenz, and R. Urtasun. Are we ready for auton omous driving? the KITTI vision benchmark suite. In CVPR, pages 3354– 3361, 2012. 3, 5, 8, 12

work page 2012
[17]

Gidaris and N

S. Gidaris and N. Komodakis. Object detection via a multi -region and semantic segmentation-aware CNN model. In ICCV, pages 1134–1142, 2015. 3, 6

work page 2015
[18]

Gidaris and N

S. Gidaris and N. Komodakis. Attend reﬁne repeat: Active box proposal generation via in-out localization. In BMVC, 2016. 3, 5, 6, 10, 11

work page 2016
[19]

Gidaris and N

S. Gidaris and N. Komodakis. Locnet: Improving localiza tion accuracy for object detection. In CVPR, pages 789–798, 2016. 3, 5

work page 2016
[20]

Girshick, I

R. Girshick, I. Radosavovic, G. Gkioxari, P . Doll´ ar, and K. He. De- tectron. https://github.com/facebookresearch/detectron, 2018. 3, 11, 13

work page 2018
[21]

R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015. 1, 3, 4, 8

work page 2015
[22]

R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Ric h feature hi- erarchies for accurate object detection and semantic segme ntation. In CVPR, pages 580–587, 2014. 1, 3

work page 2014
[23]

S. Han, J. Pool, J. Tran, and W. J. Dally . Learning both wei ghts and connections for efﬁcient neural network. In NIPS, pages 1135– 1143, 2015. 8

work page 2015
[24]

K. He, R. Girshick, and P . Doll´ ar. Rethinking imagenet pre- training. arXiv preprint arXiv:1811.08883 , 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

K. He, G. Gkioxari, P . Doll´ ar, and R. Girshick. Mask r-c nn. In ICCV, 2017. 3, 4, 7, 8, 10, 11

work page 2017
[26]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, pages 346–361, 2014. 3

work page 2014
[27]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning f or image recognition. In CVPR, pages 770–778, 2016. 4, 6, 8, 11

work page 2016
[28]

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y . Wei. Relation networ ks for object detection. In IEEE CVPR, volume 2, 2018. 3, 10, 11

work page 2018
[29]

Speed/accuracy trade-offs for modern convolutional object detectors

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fat hi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, and K. Murphy . Speed/accuracy trade-offs for modern convolutional object detec- tors. CoRR, abs/1611.10012, 2016. 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2016
[30]

Ioffe and C

S. Ioffe and C. Szegedy . Batch normalization: Accelerati ng deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015. 11

work page 2015
[31]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev , J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional archite cture for fast feature embedding. In MM, pages 675–678, 2014. 3, 8

work page 2014
[32]

Jiang, R

B. Jiang, R. Luo, J. Mao, T. Xiao, and Y . Jiang. Acquisiti on of localization conﬁdence for accurate object detection. In ECCV, pages 816–832, 2018. 3

work page 2018
[33]

Law and J

H. Law and J. Deng. Cornernet: Detecting objects as pair ed keypoints. In ECCV, pages 765–781, 2018. 3, 10, 11

work page 2018
[34]

H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolution al neural network cascade for face detection. In CVPR, pages 5325– 5334, 2015. 3

work page 2015
[35]

Z. Li, C. Peng, G. Y u, X. Zhang, Y . Deng, and J. Sun. Detnet: Design backbone for object detection. In ECCV, pages 339–354, 2018. 10, 11

work page 2018
[36]

T. Lin, M. Maire, S. J. Belongie, J. Hays, P . Perona, D. Ram anan, P . Doll´ ar, and C. L. Zitnick. Microsoft COCO: common object s in context. In ECCV, pages 740–755, 2014. 2, 5, 8

work page 2014
[37]

T.-Y . Lin, P . Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Be- longie. Feature pyramid networks for object detection. In CVPR,

work page
[38]

1, 3, 4, 5, 8, 10, 11

work page
[39]

T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Doll´ ar. Focal loss for dense object detection. In ICCV, 2017. 3, 10, 11

work page 2017
[40]

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation ne twork for instance segmentation. In IEEE CVPR, pages 8759–8768, 2018. 3, 7

work page 2018
[41]

W. Liu, D. Anguelov , D. Erhan, C. Szegedy , S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016. 3, 11, 12

work page 2016
[42]

W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen. Learning efﬁ- cient single-stage pedestrian detectors by asymptotic loc alization ﬁtting. In ECCV, pages 643–659, 2018. 3

work page 2018
[43]

Masnadi-Shirazi and N

H. Masnadi-Shirazi and N. Vasconcelos. Cost-sensitive boosting. IEEE T rans. Pattern Anal. Mach. Intell., 33(2):294–309, 2011. 1

work page 2011
[44]

Najibi, M

M. Najibi, M. Rastegari, and L. S. Davis. G-CNN: an iterat ive grid based object detector. In CVPR, pages 2369–2377, 2016. 3

work page 2016
[45]

Learning Chained Deep Features and Classifiers for Cascade in Object Detection

W. Ouyang, K. Wang, X. Zhu, and X. Wang. Learning chained deep features and classiﬁers for cascade in object detectio n. CoRR, abs/1702.07054, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

C. Peng, T. Xiao, Z. Li, Y . Jiang, X. Zhang, K. Jia, G. Y u, a nd J. Sun. Megdet: A large mini-batch object detector. In IEEE CVPR, pages 6181–6189, 2018. 3, 10, 11

work page 2018
[47]

Redmon, S

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. Y ou only look once: Uniﬁed, real-time object detection. In CVPR, pages 779– 788, 2016. 3, 11

work page 2016
[48]

S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towa rds real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015. 1, 3, 4, 5, 8, 11, 12

work page 2015
[49]

Russakovsky , J

O. Russakovsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challeng e. International Journal of Computer Vision , 115(3):211–252, 2015. 5

work page 2015
[50]

M. J. Saberian and N. Vasconcelos. Learning optimal embe dded cascades. IEEE T rans. Pattern Anal. Mach. Intell. , 34(10):2005–2018,

work page 2005
[51]

Shrivastava, A

A. Shrivastava, A. Gupta, and R. B. Girshick. Training re gion- based object detectors with online hard example mining. In CVPR, pages 761–769, 2016. 3, 8

work page 2016
[52]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. V ery deep convolutional ne t- works for large-scale image recognition. CoRR, abs/1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Singh and L

B. Singh and L. S. Davis. An analysis of scale invariance in object detection–snip. In IEEE CVPR, pages 3578–3587, 2018. 3, 10, 11

work page 2018
[54]

J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision , 104(2):154–171, 2013. 5

work page 2013
[55]

P . A. Viola and M. J. Jones. Robust real-time face detect ion. International Journal of Computer Vision , 57(2):137–154, 2004. 2, 3

work page 2004
[56]

X. Wu, D. Zhang, J. Zhu, and S. C. H. Hoi. Single-shot bidire c- tional pyramid networks for high-quality object detection . CoRR, abs/1803.08208, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[57]

Wu and K

Y . Wu and K. He. Group normalization. In ECCV, pages 3–19,

work page
[58]

S. Xie, R. B. Girshick, P . Doll´ ar, Z. Tu, and K. He. Aggreg ated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017. 10, 12

work page 2017
[59]

Xiong and F

X. Xiong and F. D. la Torre. Supervised descent method and its applications to face alignment. In CVPR, pages 532–539, 2013. 5

work page 2013
[60]

J. Y an, Z. Lei, D. Yi, and S. Li. Learn to combine multiple hypotheses for accurate face alignment. In ICCV Workshops, pages 392–396, 2013. 5

work page 2013
[61]

Y ang, J

B. Y ang, J. Y an, Z. Lei, and S. Z. Li. CRAFT objects from ima ges. In CVPR, pages 6043–6051, 2016. 3

work page 2016
[62]

Y ang, P

S. Y ang, P . Luo, C. C. Loy , and X. Tang. WIDER FACE: A face detection benchmark. In CVPR, pages 5525–5533, 2016. 3, 5, 8, 12

work page 2016
[63]

D. Y oo, S. Park, J. Lee, A. S. Paek, and I. Kweon. Attentionn et: Aggregating weak directions for accurate object detection . In ICCV, pages 2659–2667, 2015. 3

work page 2015
[64]

Zagoruyko, A

S. Zagoruyko, A. Lerer, T. Lin, P . O. Pinheiro, S. Gross, S. C hintala, and P . Doll´ ar. A multipath network for object detection. InBMVC,

work page
[65]

Zhang, R

S. Zhang, R. Benenson, and B. Schiele. Citypersons: A dive rse dataset for pedestrian detection. In CVPR, pages 4457–4465, 2017. 3, 5, 8, 12

work page 2017
[66]

Zhang, L

S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot reﬁnement neural network for object detection. In IEEE CVPR ,

work page

[1] [1]

Bodla, B

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms - improving object detection with one line of code. In ICCV, pages 5562–5570, 2017. 3, 10

work page 2017

[2] [2]

Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A uniﬁed mul ti- scale deep convolutional neural network for fast object det ection. In ECCV, pages 354–370, 2016. 3, 4, 5, 8, 12

work page 2016

[3] [3]

Cai and N

Z. Cai and N. Vasconcelos. Cascade R-CNN: Delving into hi gh quality object detection. In CVPR, 2018. 3, 8

work page 2018

[4] [4]

X. Cao, Y . Wei, F. Wen, and J. Sun. Face alignment by explici t shape regression. In CVPR, pages 2887–2894, 2012. 5

work page 2012

[5] [5]

K. Chen, J. Pang, J. Wang, Y . Xiong, X. Li, S. Sun, W. Feng, Z. L iu, J. Shi, W. Ouyang, C. C. Loy , and D. Lin. Hybrid task cascade for instance segmentation. CoRR, abs/1901.07518, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1901

[6] [6]

Cortes and V

C. Cortes and V . Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. 2

work page 1995

[7] [7]

J. Dai, K. He, and J. Sun. Instance-aware semantic segment ation via multi-task network cascades. In CVPR, pages 3150–3158, 2016. 3, 7 14

work page 2016

[8] [8]

J. Dai, Y . Li, K. He, and J. Sun. R-FCN: object detection via region- based fully convolutional networks. In NIPS, pages 379–387, 2016. 3, 4, 8, 11

work page 2016

[9] [9]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In ICCV, 2017. 10, 11

work page 2017

[10] [10]

Doll´ ar, P

P . Doll´ ar, P . Welinder, and P . Perona. Cascaded pose regression. In CVPR, pages 1078–1085, 2010. 5

work page 2010

[11] [11]

Doll´ ar, C

P . Doll´ ar, C. Wojek, B. Schiele, and P . Perona. Pedestrian detection: An evaluation of the state of the art. IEEE T rans. Pattern Anal. Mach. Intell., 34(4):743–761, 2012. 5

work page 2012

[12] [12]

C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973–978, 2001. 1

work page 2001

[13] [13]

Everingham, L

M. Everingham, L. J. V . Gool, C. K. I. Williams, J. M. Winn , and A. Zisserman. The pascal visual object classes (VOC) challe nge. International Journal of Computer Vision , 88(2):303–338, 2010. 3, 5, 8, 12

work page 2010

[14] [14]

P . F. Felzenszwalb, R. B. Girshick, D. A. McAllester, an d D. Ra- manan. Object detection with discriminatively trained par t-based models. IEEE T rans. Pattern Anal. Mach. Intell. , 32(9):1627–1645,

work page

[15] [15]

Freund and R

Y . Freund and R. E. Schapire. A decision-theoretic gener alization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. 2

work page 1995

[16] [16]

Geiger, P

A. Geiger, P . Lenz, and R. Urtasun. Are we ready for auton omous driving? the KITTI vision benchmark suite. In CVPR, pages 3354– 3361, 2012. 3, 5, 8, 12

work page 2012

[17] [17]

Gidaris and N

S. Gidaris and N. Komodakis. Object detection via a multi -region and semantic segmentation-aware CNN model. In ICCV, pages 1134–1142, 2015. 3, 6

work page 2015

[18] [18]

Gidaris and N

S. Gidaris and N. Komodakis. Attend reﬁne repeat: Active box proposal generation via in-out localization. In BMVC, 2016. 3, 5, 6, 10, 11

work page 2016

[19] [19]

Gidaris and N

S. Gidaris and N. Komodakis. Locnet: Improving localiza tion accuracy for object detection. In CVPR, pages 789–798, 2016. 3, 5

work page 2016

[20] [20]

Girshick, I

R. Girshick, I. Radosavovic, G. Gkioxari, P . Doll´ ar, and K. He. De- tectron. https://github.com/facebookresearch/detectron, 2018. 3, 11, 13

work page 2018

[21] [21]

R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015. 1, 3, 4, 8

work page 2015

[22] [22]

R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Ric h feature hi- erarchies for accurate object detection and semantic segme ntation. In CVPR, pages 580–587, 2014. 1, 3

work page 2014

[23] [23]

S. Han, J. Pool, J. Tran, and W. J. Dally . Learning both wei ghts and connections for efﬁcient neural network. In NIPS, pages 1135– 1143, 2015. 8

work page 2015

[24] [24]

K. He, R. Girshick, and P . Doll´ ar. Rethinking imagenet pre- training. arXiv preprint arXiv:1811.08883 , 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

K. He, G. Gkioxari, P . Doll´ ar, and R. Girshick. Mask r-c nn. In ICCV, 2017. 3, 4, 7, 8, 10, 11

work page 2017

[26] [26]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, pages 346–361, 2014. 3

work page 2014

[27] [27]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning f or image recognition. In CVPR, pages 770–778, 2016. 4, 6, 8, 11

work page 2016

[28] [28]

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y . Wei. Relation networ ks for object detection. In IEEE CVPR, volume 2, 2018. 3, 10, 11

work page 2018

[29] [29]

Speed/accuracy trade-offs for modern convolutional object detectors

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fat hi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, and K. Murphy . Speed/accuracy trade-offs for modern convolutional object detec- tors. CoRR, abs/1611.10012, 2016. 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2016

[30] [30]

Ioffe and C

S. Ioffe and C. Szegedy . Batch normalization: Accelerati ng deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015. 11

work page 2015

[31] [31]

Y . Jia, E. Shelhamer, J. Donahue, S. Karayev , J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional archite cture for fast feature embedding. In MM, pages 675–678, 2014. 3, 8

work page 2014

[32] [32]

Jiang, R

B. Jiang, R. Luo, J. Mao, T. Xiao, and Y . Jiang. Acquisiti on of localization conﬁdence for accurate object detection. In ECCV, pages 816–832, 2018. 3

work page 2018

[33] [33]

Law and J

H. Law and J. Deng. Cornernet: Detecting objects as pair ed keypoints. In ECCV, pages 765–781, 2018. 3, 10, 11

work page 2018

[34] [34]

H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolution al neural network cascade for face detection. In CVPR, pages 5325– 5334, 2015. 3

work page 2015

[35] [35]

Z. Li, C. Peng, G. Y u, X. Zhang, Y . Deng, and J. Sun. Detnet: Design backbone for object detection. In ECCV, pages 339–354, 2018. 10, 11

work page 2018

[36] [36]

T. Lin, M. Maire, S. J. Belongie, J. Hays, P . Perona, D. Ram anan, P . Doll´ ar, and C. L. Zitnick. Microsoft COCO: common object s in context. In ECCV, pages 740–755, 2014. 2, 5, 8

work page 2014

[37] [37]

T.-Y . Lin, P . Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Be- longie. Feature pyramid networks for object detection. In CVPR,

work page

[38] [38]

1, 3, 4, 5, 8, 10, 11

work page

[39] [39]

T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Doll´ ar. Focal loss for dense object detection. In ICCV, 2017. 3, 10, 11

work page 2017

[40] [40]

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation ne twork for instance segmentation. In IEEE CVPR, pages 8759–8768, 2018. 3, 7

work page 2018

[41] [41]

W. Liu, D. Anguelov , D. Erhan, C. Szegedy , S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016. 3, 11, 12

work page 2016

[42] [42]

W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen. Learning efﬁ- cient single-stage pedestrian detectors by asymptotic loc alization ﬁtting. In ECCV, pages 643–659, 2018. 3

work page 2018

[43] [43]

Masnadi-Shirazi and N

H. Masnadi-Shirazi and N. Vasconcelos. Cost-sensitive boosting. IEEE T rans. Pattern Anal. Mach. Intell., 33(2):294–309, 2011. 1

work page 2011

[44] [44]

Najibi, M

M. Najibi, M. Rastegari, and L. S. Davis. G-CNN: an iterat ive grid based object detector. In CVPR, pages 2369–2377, 2016. 3

work page 2016

[45] [45]

Learning Chained Deep Features and Classifiers for Cascade in Object Detection

W. Ouyang, K. Wang, X. Zhu, and X. Wang. Learning chained deep features and classiﬁers for cascade in object detectio n. CoRR, abs/1702.07054, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

C. Peng, T. Xiao, Z. Li, Y . Jiang, X. Zhang, K. Jia, G. Y u, a nd J. Sun. Megdet: A large mini-batch object detector. In IEEE CVPR, pages 6181–6189, 2018. 3, 10, 11

work page 2018

[47] [47]

Redmon, S

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. Y ou only look once: Uniﬁed, real-time object detection. In CVPR, pages 779– 788, 2016. 3, 11

work page 2016

[48] [48]

S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towa rds real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015. 1, 3, 4, 5, 8, 11, 12

work page 2015

[49] [49]

Russakovsky , J

O. Russakovsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. Imagenet large scale visual recognition challeng e. International Journal of Computer Vision , 115(3):211–252, 2015. 5

work page 2015

[50] [50]

M. J. Saberian and N. Vasconcelos. Learning optimal embe dded cascades. IEEE T rans. Pattern Anal. Mach. Intell. , 34(10):2005–2018,

work page 2005

[51] [51]

Shrivastava, A

A. Shrivastava, A. Gupta, and R. B. Girshick. Training re gion- based object detectors with online hard example mining. In CVPR, pages 761–769, 2016. 3, 8

work page 2016

[52] [52]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman. V ery deep convolutional ne t- works for large-scale image recognition. CoRR, abs/1409.1556,

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Singh and L

B. Singh and L. S. Davis. An analysis of scale invariance in object detection–snip. In IEEE CVPR, pages 3578–3587, 2018. 3, 10, 11

work page 2018

[54] [54]

J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision , 104(2):154–171, 2013. 5

work page 2013

[55] [55]

P . A. Viola and M. J. Jones. Robust real-time face detect ion. International Journal of Computer Vision , 57(2):137–154, 2004. 2, 3

work page 2004

[56] [56]

X. Wu, D. Zhang, J. Zhu, and S. C. H. Hoi. Single-shot bidire c- tional pyramid networks for high-quality object detection . CoRR, abs/1803.08208, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[57] [57]

Wu and K

Y . Wu and K. He. Group normalization. In ECCV, pages 3–19,

work page

[58] [58]

S. Xie, R. B. Girshick, P . Doll´ ar, Z. Tu, and K. He. Aggreg ated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017. 10, 12

work page 2017

[59] [59]

Xiong and F

X. Xiong and F. D. la Torre. Supervised descent method and its applications to face alignment. In CVPR, pages 532–539, 2013. 5

work page 2013

[60] [60]

J. Y an, Z. Lei, D. Yi, and S. Li. Learn to combine multiple hypotheses for accurate face alignment. In ICCV Workshops, pages 392–396, 2013. 5

work page 2013

[61] [61]

Y ang, J

B. Y ang, J. Y an, Z. Lei, and S. Z. Li. CRAFT objects from ima ges. In CVPR, pages 6043–6051, 2016. 3

work page 2016

[62] [62]

Y ang, P

S. Y ang, P . Luo, C. C. Loy , and X. Tang. WIDER FACE: A face detection benchmark. In CVPR, pages 5525–5533, 2016. 3, 5, 8, 12

work page 2016

[63] [63]

D. Y oo, S. Park, J. Lee, A. S. Paek, and I. Kweon. Attentionn et: Aggregating weak directions for accurate object detection . In ICCV, pages 2659–2667, 2015. 3

work page 2015

[64] [64]

Zagoruyko, A

S. Zagoruyko, A. Lerer, T. Lin, P . O. Pinheiro, S. Gross, S. C hintala, and P . Doll´ ar. A multipath network for object detection. InBMVC,

work page

[65] [65]

Zhang, R

S. Zhang, R. Benenson, and B. Schiele. Citypersons: A dive rse dataset for pedestrian detection. In CVPR, pages 4457–4465, 2017. 3, 5, 8, 12

work page 2017

[66] [66]

Zhang, L

S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot reﬁnement neural network for object detection. In IEEE CVPR ,

work page