Rethinking Classification and Localization for Cascade R-CNN

Ang Li; Chongyang Zhang; Xue Yang

arxiv: 1907.11914 · v1 · pith:EIDNVOJVnew · submitted 2019-07-27 · 💻 cs.CV

Rethinking Classification and Localization for Cascade R-CNN

Ang Li , Xue Yang , Chongyang Zhang This is my paper

Pith reviewed 2026-05-24 14:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords Cascade R-CNNfeature sharingobject detectionIoU thresholdsCOCO datasetmulti-stage detectorsaverage precision

0 comments

The pith

A simple feature sharing mechanism added to every stage of Cascade R-CNN narrows the performance gap on low IoU thresholds and reaches 43.2 AP on COCO without test ensembles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends Cascade R-CNN by embedding a feature sharing mechanism across all stages. This targets the drop in performance that earlier stages show on low IoU thresholds compared to the final stage. The shared features let the network itself close that gap instead of relying on test-time ensembles. The change also lifts accuracy at every IoU level while adding almost no parameters, producing a new high of 43.2 AP on COCO object detection.

Core claim

By embedding a simple feature sharing mechanism into all stages of Cascade R-CNN, the gap between the last stage and preceding stages on low IoU thresholds can be narrowed without resorting to testing ensemble but the network itself, with obvious improvements on all IoU thresholds and the resulting cascade structure matching or exceeding counterparts with negligible extra parameters, demonstrated by 43.2 AP on COCO.

What carries the argument

The feature sharing mechanism embedded into all stages, which shares features to reduce performance disparity across IoU thresholds.

If this is right

Earlier stages match the last stage more closely on low IoU thresholds using only the network.
Accuracy rises across all IoU thresholds.
The cascade matches or exceeds other detectors with almost no added parameters.
43.2 AP is reached on COCO without any testing ensemble or extra tricks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing pattern could be tested on other multi-stage detectors to see if ensemble dependence drops in general.
Lower need for test-time methods might simplify real-time deployment pipelines.
The small parameter cost suggests the idea could scale to deeper or wider cascades without much overhead.
Feature reuse across stages may interact with other efficiency techniques such as pruning or quantization.

Load-bearing premise

The observed gains come from the feature-sharing links themselves rather than from unstated changes in training schedule, data augmentation, or hyper-parameter tuning.

What would settle it

Retrain the original Cascade R-CNN under the exact same schedule and augmentation used for the new model but without the feature sharing links, then measure whether the low-IoU gap between stages remains the same size.

Figures

Figures reproduced from arXiv: 1907.11914 by Ang Li, Chongyang Zhang, Xue Yang.

**Figure 1.** Figure 1: Solid line: the difference of Stage3 minus FPN, and Stage3 minus stage2 on the AP value (4 AP) between IoU 0.5 and 0.95 for original Cascade RCNN. Dashed line: the differene of FSStage3 minus FPN, and FSStage3 minus FSStage2 on the AP value (4 AP) for our FSCascade. We show the results based on ResNet-50 and ‘1×’ training strategy. Our approach is more close to zero on low IoU thresholds (lower gap) and mu… view at source ↗

**Figure 2.** Figure 2: Structures of classification feature sharing (CFS) and localization feature sharing (LFS) shown in the left and right, respectively. The CFS is executed in a parallel way and the LFS in a serial manner. We show the sharing structure of the 3rd stage for simplicity. 2 Related Works Multi-stage object detectors are very popular in recent years. Following the main idea of “divide and conquer’, these detectors… view at source ↗

**Figure 3.** Figure 3: The difference of stage3 minus FPN, and stage3 minus stage2 on the AP value (4 AP) for all the experiments based on ResNet-50 with ‘1×’ training strategy. (a) is original Cascade R-CNN, (b) is FSCascade only with classification features shared (CFS). (c) is FSCascade only with localization features shared (LFS), and (d) is our FSCascade. 3.2 Localization Solved the problem of imbalanced scoring for the las… view at source ↗

**Figure 4.** Figure 4: The overall performance on COCO val sets of our FSCascade vs. Cascade R-CNN based on ResNet-50 with ‘1×’ training strategy. The IoU thresholds between 0.5 and 0.75 are shown above to verify: (1) the narrowed gap (blue dashed line vs. blue solid line), (2) the overall improvements (red solid line vs. blue solid line). 4 Experiments We perform experiments on MS-COCO 2017 datasets, all models were trained on … view at source ↗

**Figure 5.** Figure 5: Confidence scores of the detection boxes whose IoU with ground truth boxes between 0.5 and 0.75 on the COCO val sets, more balanced scores can be seen in the right image (b) based on FSCascade. 4.2 Ablation Studies Classification feature sharing (CFS) is essential. In [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

We extend the state-of-the-art Cascade R-CNN with a simple feature sharing mechanism. Our approach focuses on the performance increases on high IoU but decreases on low IoU thresholds--a key problem this detector suffers from. Feature sharing is extremely helpful, our results show that given this mechanism embedded into all stages, we can easily narrow the gap between the last stage and preceding stages on low IoU thresholds without resorting to the commonly used testing ensemble but the network itself. We also observe obvious improvements on all IoU thresholds benefited from feature sharing, and the resulting cascade structure can easily match or exceed its counterparts, only with negligible extra parameters introduced. To push the envelope, we demonstrate 43.2 AP on COCO object detection without any bells and whistles including testing ensemble, surpassing previous Cascade R-CNN by a large margin. Our framework is easy to implement and we hope it can serve as a general and strong baseline for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Feature sharing across Cascade R-CNN stages is claimed to deliver 43.2 AP, but the numbers are not isolated from possible training or augmentation differences.

read the letter

The main takeaway is that this work adds feature sharing to every stage of Cascade R-CNN to reduce the low-IoU performance drop, reaching 43.2 AP on COCO without ensembles. The change is small and adds almost no parameters. They report gains at all IoU thresholds and say the network itself now closes the gap between stages. The method is easy to add on top of existing code. That is the concrete contribution. The paper does a reasonable job of stating the problem in Cascade R-CNN and showing that a simple link can help in their runs. The numbers are specific enough to be worth checking if you already use this detector. The soft spot is the missing controls. The description gives no sign that the original Cascade R-CNN baseline was retrained with identical schedule, augmentation, optimizer, or hyperparameters. Any deviation would make it impossible to credit the feature sharing for the lift. No ablation tables or error breakdowns are referenced, so the size of the actual effect remains unclear. This paper is for people already running Cascade R-CNN in detection pipelines who want a quick tweak to try. A reader who needs a stronger baseline or is doing production tuning might get value from testing the idea. It deserves peer review because the claim is testable with standard code and the result is concrete, even if the current write-up leaves the attribution open.

Referee Report

2 major / 0 minor

Summary. The paper extends Cascade R-CNN by embedding a feature-sharing mechanism across all stages. It claims this narrows the performance gap between the final stage and earlier stages on low-IoU thresholds, yields consistent gains across all IoU thresholds, introduces negligible extra parameters, and reaches 43.2 AP on COCO without testing ensembles or other bells and whistles, surpassing prior Cascade R-CNN results.

Significance. If the reported gains can be isolated to the feature-sharing links under controlled training conditions, the work would supply a lightweight, easily implemented improvement to cascade detectors that could serve as a reproducible baseline for subsequent object-detection research.

major comments (2)

[Abstract] Abstract and results: the central attribution of the 43.2 AP improvement and the low-IoU gap reduction to the feature-sharing mechanism requires an explicit statement that the baseline Cascade R-CNN was reproduced with identical training schedule, data augmentation, optimizer, and hyper-parameters; no such statement or controlled comparison is supplied, leaving the contribution unverified.
[Abstract] Abstract: the manuscript reports empirical gains but supplies neither ablation tables isolating the effect of feature sharing on low-IoU thresholds nor training details or error analysis, rendering the key claim that the network itself (rather than external factors) closes the stage gap unverifiable from the given text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing that additional clarifications and experiments are needed to strengthen verifiability.

read point-by-point responses

Referee: [Abstract] Abstract and results: the central attribution of the 43.2 AP improvement and the low-IoU gap reduction to the feature-sharing mechanism requires an explicit statement that the baseline Cascade R-CNN was reproduced with identical training schedule, data augmentation, optimizer, and hyper-parameters; no such statement or controlled comparison is supplied, leaving the contribution unverified.

Authors: We agree that an explicit statement confirming controlled reproduction of the baseline is required to isolate the contribution of feature sharing. In the revised manuscript we will add this statement to the abstract and results sections, confirming that the baseline Cascade R-CNN was reproduced with identical training schedule, data augmentation, optimizer, and hyper-parameters. revision: yes
Referee: [Abstract] Abstract: the manuscript reports empirical gains but supplies neither ablation tables isolating the effect of feature sharing on low-IoU thresholds nor training details or error analysis, rendering the key claim that the network itself (rather than external factors) closes the stage gap unverifiable from the given text.

Authors: We acknowledge that the current manuscript lacks dedicated ablation tables isolating feature sharing on low-IoU thresholds as well as expanded training details and error analysis. We will add these elements (including new ablations and error analysis) to the revised version to make the claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture change with reported metrics

full rationale

The paper describes an empirical extension of Cascade R-CNN by adding a feature-sharing mechanism across stages and reports COCO AP numbers (including 43.2). No derivation chain, equations, fitted parameters, or first-principles predictions are presented that could reduce to their own inputs by construction. Claims rest on experimental outcomes rather than self-definitional relations, fitted-input predictions, or load-bearing self-citations. Any concerns about confounding variables in training belong to correctness evaluation, not circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work relies on standard deep-learning training assumptions and the original Cascade R-CNN design.

pith-pipeline@v0.9.0 · 5686 in / 1057 out tokens · 22851 ms · 2026-05-24T14:46:16.424545+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Soft-nms– improving object detection with one line of code

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms– improving object detection with one line of code. In Proceedings of the IEEE In- ternational Conference on Computer Vision, pages 5561–5569, 2017

work page 2017
[2]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018

work page 2018
[3]

Hybrid Task Cascade for Instance Segmentation

Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. arXiv preprint arXiv:1901.07518, 2019. 10LI ET AL.: RETHINKING CLASSIFICA TION AND LOCALIZA TION FOR CASCADE R-CNN

work page internal anchor Pith review Pith/arXiv arXiv 1901
[4]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017

work page 2017
[5]

Object detection via a multi-region and semantic segmentation-aware cnn model

Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015

work page 2015
[6]

Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization

Spyros Gidaris and Nikos Komodakis. Attend reﬁne repeat: Active box proposal gen- eration via in-out localization. arXiv preprint arXiv:1606.04446, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015

work page 2015
[8]

Rich feature hierar- chies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar- chies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

work page 2014
[9]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Pro- ceedings of the IEEE international conference on computer vision , pages 2961–2969, 2017

work page 2017
[10]

Cornernet: Detecting objects as paired keypoints

Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceed- ings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018

work page 2018
[11]

Gradient Harmonized Single-stage Detector

Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. arXiv preprint arXiv:1811.05181, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

A convolutional neural network cascade for face detection

Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5325–5334, 2015

work page 2015
[13]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017

work page 2017
[14]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[15]

Grid R-CNN

Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid R-CNN. CoRR, abs/1811.12030, 2018. URL http://arxiv.org/abs/1811.12030

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Faster r-cnn: Towards real- time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015

work page 2015
[17]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1492–1500, 2017. LI ET AL.: RETHINKING CLASSIFICA TION AND LOCALIZA TION FOR CASCADE R-CNN11

work page 2017
[18]

Craft objects from images

Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Craft objects from images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6043–6051, 2016

work page 2016
[19]

Cascade Region Proposal and Global Context for Deep Object Detection

Qiaoyong Zhong, Chao Li, Yingying Zhang, Di Xie, Shicai Yang, and Shiliang Pu. Cascade region proposal and global context for deep object detection. arXiv preprint arXiv:1710.10749, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Feature Selective Anchor-Free Module for Single-Shot Object Detection

Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[1] [1]

Soft-nms– improving object detection with one line of code

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms– improving object detection with one line of code. In Proceedings of the IEEE In- ternational Conference on Computer Vision, pages 5561–5569, 2017

work page 2017

[2] [2]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018

work page 2018

[3] [3]

Hybrid Task Cascade for Instance Segmentation

Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. arXiv preprint arXiv:1901.07518, 2019. 10LI ET AL.: RETHINKING CLASSIFICA TION AND LOCALIZA TION FOR CASCADE R-CNN

work page internal anchor Pith review Pith/arXiv arXiv 1901

[4] [4]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017

work page 2017

[5] [5]

Object detection via a multi-region and semantic segmentation-aware cnn model

Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015

work page 2015

[6] [6]

Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization

Spyros Gidaris and Nikos Komodakis. Attend reﬁne repeat: Active box proposal gen- eration via in-out localization. arXiv preprint arXiv:1606.04446, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015

work page 2015

[8] [8]

Rich feature hierar- chies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar- chies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

work page 2014

[9] [9]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Pro- ceedings of the IEEE international conference on computer vision , pages 2961–2969, 2017

work page 2017

[10] [10]

Cornernet: Detecting objects as paired keypoints

Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceed- ings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018

work page 2018

[11] [11]

Gradient Harmonized Single-stage Detector

Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. arXiv preprint arXiv:1811.05181, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

A convolutional neural network cascade for face detection

Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5325–5334, 2015

work page 2015

[13] [13]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017

work page 2017

[14] [14]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017

[15] [15]

Grid R-CNN

Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid R-CNN. CoRR, abs/1811.12030, 2018. URL http://arxiv.org/abs/1811.12030

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Faster r-cnn: Towards real- time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015

work page 2015

[17] [17]

Aggregated residual transformations for deep neural networks

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1492–1500, 2017. LI ET AL.: RETHINKING CLASSIFICA TION AND LOCALIZA TION FOR CASCADE R-CNN11

work page 2017

[18] [18]

Craft objects from images

Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Craft objects from images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6043–6051, 2016

work page 2016

[19] [19]

Cascade Region Proposal and Global Context for Deep Object Detection

Qiaoyong Zhong, Chao Li, Yingying Zhang, Di Xie, Shicai Yang, and Shiliang Pu. Cascade region proposal and global context for deep object detection. arXiv preprint arXiv:1710.10749, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Feature Selective Anchor-Free Module for Single-Shot Object Detection

Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903