Rethinking Classification and Localization for Cascade R-CNN
Pith reviewed 2026-05-24 14:46 UTC · model grok-4.3
The pith
A simple feature sharing mechanism added to every stage of Cascade R-CNN narrows the performance gap on low IoU thresholds and reaches 43.2 AP on COCO without test ensembles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By embedding a simple feature sharing mechanism into all stages of Cascade R-CNN, the gap between the last stage and preceding stages on low IoU thresholds can be narrowed without resorting to testing ensemble but the network itself, with obvious improvements on all IoU thresholds and the resulting cascade structure matching or exceeding counterparts with negligible extra parameters, demonstrated by 43.2 AP on COCO.
What carries the argument
The feature sharing mechanism embedded into all stages, which shares features to reduce performance disparity across IoU thresholds.
If this is right
- Earlier stages match the last stage more closely on low IoU thresholds using only the network.
- Accuracy rises across all IoU thresholds.
- The cascade matches or exceeds other detectors with almost no added parameters.
- 43.2 AP is reached on COCO without any testing ensemble or extra tricks.
Where Pith is reading between the lines
- The same sharing pattern could be tested on other multi-stage detectors to see if ensemble dependence drops in general.
- Lower need for test-time methods might simplify real-time deployment pipelines.
- The small parameter cost suggests the idea could scale to deeper or wider cascades without much overhead.
- Feature reuse across stages may interact with other efficiency techniques such as pruning or quantization.
Load-bearing premise
The observed gains come from the feature-sharing links themselves rather than from unstated changes in training schedule, data augmentation, or hyper-parameter tuning.
What would settle it
Retrain the original Cascade R-CNN under the exact same schedule and augmentation used for the new model but without the feature sharing links, then measure whether the low-IoU gap between stages remains the same size.
Figures
read the original abstract
We extend the state-of-the-art Cascade R-CNN with a simple feature sharing mechanism. Our approach focuses on the performance increases on high IoU but decreases on low IoU thresholds--a key problem this detector suffers from. Feature sharing is extremely helpful, our results show that given this mechanism embedded into all stages, we can easily narrow the gap between the last stage and preceding stages on low IoU thresholds without resorting to the commonly used testing ensemble but the network itself. We also observe obvious improvements on all IoU thresholds benefited from feature sharing, and the resulting cascade structure can easily match or exceed its counterparts, only with negligible extra parameters introduced. To push the envelope, we demonstrate 43.2 AP on COCO object detection without any bells and whistles including testing ensemble, surpassing previous Cascade R-CNN by a large margin. Our framework is easy to implement and we hope it can serve as a general and strong baseline for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends Cascade R-CNN by embedding a feature-sharing mechanism across all stages. It claims this narrows the performance gap between the final stage and earlier stages on low-IoU thresholds, yields consistent gains across all IoU thresholds, introduces negligible extra parameters, and reaches 43.2 AP on COCO without testing ensembles or other bells and whistles, surpassing prior Cascade R-CNN results.
Significance. If the reported gains can be isolated to the feature-sharing links under controlled training conditions, the work would supply a lightweight, easily implemented improvement to cascade detectors that could serve as a reproducible baseline for subsequent object-detection research.
major comments (2)
- [Abstract] Abstract and results: the central attribution of the 43.2 AP improvement and the low-IoU gap reduction to the feature-sharing mechanism requires an explicit statement that the baseline Cascade R-CNN was reproduced with identical training schedule, data augmentation, optimizer, and hyper-parameters; no such statement or controlled comparison is supplied, leaving the contribution unverified.
- [Abstract] Abstract: the manuscript reports empirical gains but supplies neither ablation tables isolating the effect of feature sharing on low-IoU thresholds nor training details or error analysis, rendering the key claim that the network itself (rather than external factors) closes the stage gap unverifiable from the given text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing that additional clarifications and experiments are needed to strengthen verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract and results: the central attribution of the 43.2 AP improvement and the low-IoU gap reduction to the feature-sharing mechanism requires an explicit statement that the baseline Cascade R-CNN was reproduced with identical training schedule, data augmentation, optimizer, and hyper-parameters; no such statement or controlled comparison is supplied, leaving the contribution unverified.
Authors: We agree that an explicit statement confirming controlled reproduction of the baseline is required to isolate the contribution of feature sharing. In the revised manuscript we will add this statement to the abstract and results sections, confirming that the baseline Cascade R-CNN was reproduced with identical training schedule, data augmentation, optimizer, and hyper-parameters. revision: yes
-
Referee: [Abstract] Abstract: the manuscript reports empirical gains but supplies neither ablation tables isolating the effect of feature sharing on low-IoU thresholds nor training details or error analysis, rendering the key claim that the network itself (rather than external factors) closes the stage gap unverifiable from the given text.
Authors: We acknowledge that the current manuscript lacks dedicated ablation tables isolating feature sharing on low-IoU thresholds as well as expanded training details and error analysis. We will add these elements (including new ablations and error analysis) to the revised version to make the claims directly verifiable. revision: yes
Circularity Check
No circularity: empirical architecture change with reported metrics
full rationale
The paper describes an empirical extension of Cascade R-CNN by adding a feature-sharing mechanism across stages and reports COCO AP numbers (including 43.2). No derivation chain, equations, fitted parameters, or first-principles predictions are presented that could reduce to their own inputs by construction. Claims rest on experimental outcomes rather than self-definitional relations, fitted-input predictions, or load-bearing self-citations. Any concerns about confounding variables in training belong to correctness evaluation, not circularity analysis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Soft-nms– improving object detection with one line of code
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms– improving object detection with one line of code. In Proceedings of the IEEE In- ternational Conference on Computer Vision, pages 5561–5569, 2017
work page 2017
-
[2]
Cascade r-cnn: Delving into high quality object detection
Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018
work page 2018
-
[3]
Hybrid Task Cascade for Instance Segmentation
Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. arXiv preprint arXiv:1901.07518, 2019. 10LI ET AL.: RETHINKING CLASSIFICA TION AND LOCALIZA TION FOR CASCADE R-CNN
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[4]
Deformable convolutional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017
work page 2017
-
[5]
Object detection via a multi-region and semantic segmentation-aware cnn model
Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015
work page 2015
-
[6]
Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization
Spyros Gidaris and Nikos Komodakis. Attend refine repeat: Active box proposal gen- eration via in-out localization. arXiv preprint arXiv:1606.04446, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015
work page 2015
-
[8]
Rich feature hierar- chies for accurate object detection and semantic segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar- chies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014
work page 2014
-
[9]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Pro- ceedings of the IEEE international conference on computer vision , pages 2961–2969, 2017
work page 2017
-
[10]
Cornernet: Detecting objects as paired keypoints
Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceed- ings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018
work page 2018
-
[11]
Gradient Harmonized Single-stage Detector
Buyu Li, Yu Liu, and Xiaogang Wang. Gradient harmonized single-stage detector. arXiv preprint arXiv:1811.05181, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
A convolutional neural network cascade for face detection
Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5325–5334, 2015
work page 2015
-
[13]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017
work page 2017
-
[14]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017
work page 2017
-
[15]
Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid R-CNN. CoRR, abs/1811.12030, 2018. URL http://arxiv.org/abs/1811.12030
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Faster r-cnn: Towards real- time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real- time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015
work page 2015
-
[17]
Aggregated residual transformations for deep neural networks
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 1492–1500, 2017. LI ET AL.: RETHINKING CLASSIFICA TION AND LOCALIZA TION FOR CASCADE R-CNN11
work page 2017
-
[18]
Bin Yang, Junjie Yan, Zhen Lei, and Stan Z Li. Craft objects from images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6043–6051, 2016
work page 2016
-
[19]
Cascade Region Proposal and Global Context for Deep Object Detection
Qiaoyong Zhong, Chao Li, Yingying Zhang, Di Xie, Shicai Yang, and Shiliang Pu. Cascade region proposal and global context for deep object detection. arXiv preprint arXiv:1710.10749, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Feature Selective Anchor-Free Module for Single-Shot Object Detection
Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.