Reprojection R-CNN: A Fast and Accurate Object Detector for 360{deg} Images
Pith reviewed 2026-05-24 15:02 UTC · model grok-4.3
The pith
Reprojection R-CNN detects objects more accurately in 360-degree images by reprojecting region proposals from equirectangular to perspective views.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is a two-stage detector called Reprojection R-CNN for 360° images. It first generates coarse region proposals efficiently with a distortion-aware spherical region proposal network on the equirectangular projection, which has a full omnidirectional field-of-view. Then a novel reprojection network refines the proposed regions using the distortion-free perspective projection. Two novel synthetic datasets are introduced for training and evaluation. The detector achieves better mean average precision than prior methods and processes each image in 178 milliseconds.
What carries the argument
The reprojection network that converts and refines region proposals from the equirectangular projection to perspective projections.
Load-bearing premise
The two synthetic datasets sufficiently represent the distortion patterns and labeling challenges found in actual 360-degree images, allowing the reprojection step to improve proposals without adding biases.
What would settle it
A direct comparison of detection accuracy on a collection of real 360-degree images with ground-truth annotations, checking whether the full Reprojection R-CNN exceeds the accuracy of its components without the reprojection network.
Figures
read the original abstract
360{\deg} images are usually represented in either equirectangular projection (ERP) or multiple perspective projections. Different from the flat 2D images, the detection task is challenging for 360{\deg} images due to the distortion of ERP and the inefficiency of perspective projections. However, existing methods mostly focus on one of the above representations instead of both, leading to limited detection performance. Moreover, the lack of appropriate bounding-box annotations as well as the annotated datasets further increases the difficulties of the detection task. In this paper, we present a standard object detection framework for 360{\deg} images. Specifically, we adapt the terminologies of the traditional object detection task to the omnidirectional scenarios, and propose a novel two-stage object detector, i.e., Reprojection R-CNN by combining both ERP and perspective projection. Owing to the omnidirectional field-of-view of ERP, Reprojection R-CNN first generates coarse region proposals efficiently by a distortion-aware spherical region proposal network. Then, it leverages the distortion-free perspective projection and refines the proposed regions by a novel reprojection network. We construct two novel synthetic datasets for training and evaluation. Experiments reveal that Reprojection R-CNN outperforms the previous state-of-the-art methods on the mAP metric. In addition, the proposed detector could run at 178ms per image in the panoramic datasets, which implies its practicability in real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Reprojection R-CNN, a two-stage detector for 360° images that first generates coarse region proposals via a distortion-aware spherical RPN operating on equirectangular projections (ERP) and then refines them using a reprojection network on distortion-free perspective projections. It introduces two novel synthetic datasets to address the lack of annotated 360° data and reports that the method outperforms prior state-of-the-art approaches on mAP while running at 178 ms per image on the panoramic datasets.
Significance. If the results hold, the work supplies a concrete architecture that explicitly combines the wide field-of-view of ERP with the geometric fidelity of perspective views, together with new synthetic training resources; this could serve as a reusable baseline for omnidirectional detection.
major comments (1)
- [Experiments] Experiments section: all reported mAP gains and the 178 ms runtime are obtained exclusively on the two synthetic datasets created by the authors. No results, ablations, or cross-dataset tests appear on real equirectangular photographs, so the central claim that the detector is practical for 360° images rests on an unverified assumption that the synthetic distortion and annotation statistics match real imagery.
minor comments (1)
- Abstract: the claim of outperformance is stated without naming the baselines or reporting the magnitude of the mAP improvement; adding one sentence with these numbers would improve readability.
Simulated Author's Rebuttal
We thank the referee for highlighting the scope of our experimental evaluation. We agree that the absence of results on real equirectangular images is a limitation and will revise the manuscript accordingly while preserving the core technical contributions.
read point-by-point responses
-
Referee: [Experiments] Experiments section: all reported mAP gains and the 178 ms runtime are obtained exclusively on the two synthetic datasets created by the authors. No results, ablations, or cross-dataset tests appear on real equirectangular photographs, so the central claim that the detector is practical for 360° images rests on an unverified assumption that the synthetic distortion and annotation statistics match real imagery.
Authors: We acknowledge that all reported quantitative results, including mAP improvements and the 178 ms runtime, are obtained solely on the two synthetic datasets introduced in the paper. These datasets were constructed precisely because of the documented scarcity of annotated real 360° imagery (see Introduction). The rendering pipeline was designed to reproduce the geometric distortions of ERP images, and the method itself is representation-agnostic. Nevertheless, we agree that the claim of practicality for real-world 360° images would be stronger with real-data validation. In the revision we will (1) add an explicit Limitations subsection discussing the synthetic-to-real domain gap, (2) moderate the language around “practicability in real-world applications” to refer specifically to the panoramic synthetic setting, and (3) include qualitative examples on publicly available unlabeled real panoramas to illustrate generalization behavior. No new quantitative results on real annotated data can be added without additional data collection. revision: partial
- Providing quantitative mAP or runtime numbers on real annotated equirectangular photographs, as no such experiments were performed and no suitable public datasets with bounding-box annotations were available to the authors.
Circularity Check
No significant circularity; architecture and evaluation are self-contained.
full rationale
The paper proposes a novel two-stage detector (distortion-aware spherical RPN followed by reprojection network) that adapts standard detection terminology to omnidirectional images and is validated on two newly constructed synthetic datasets. No equations, parameters, or performance metrics reduce by construction to inputs defined in the authors' prior work; the mAP and runtime claims are direct experimental outputs on the introduced data rather than fitted quantities renamed as predictions. No load-bearing self-citations or uniqueness theorems imported from the same authors appear in the derivation chain. The central result is therefore independent of the patterns that would trigger circularity scores above 2.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Convolutional features remain useful when networks are adapted for spherical distortion in equirectangular images.
Reference graph
Works this paper leans on
-
[1]
J. Ardouin, A. L ´ecuyer, M. Marchal, C. Riant, and E. Marc- hand. Flyviz: a novel display device to provide humans with 360 vision by coupling catadioptric camera with hmd. In Proceedings of the 18th ACM symposium on Virtual reality software and technology, pages 41–44, 2012. 1
work page 2012
-
[2]
Y . Chen, J. Wang, J. Li, C. Lu, Z. Luo, H. Xue, and C. Wang. Lidar-video driving dataset: Learning driving policies effec- tively. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5870–5878, 2018. 1
work page 2018
-
[3]
T. S. Cohen, M. Geiger, J. K¨ohler, and M. Welling. Spherical cnns. In ICLR, 2018. 2, 4, 6, 7
work page 2018
-
[4]
B. Coors, A. Paul Condurache, and A. Geiger. Spherenet: Learning spherical representations for detection and classifi- cation in omnidirectional images. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 518– 533, 2018. 2, 3, 4, 6, 7, 8 Figure 7. More results of Rep R-CNN on the three datasets
work page 2018
-
[5]
J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 1(2):3, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Dani- ilidis. Learning so (3) equivariant representations with spher- ical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–68, 2018. 2, 6, 7
work page 2018
-
[7]
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) chal- lenge. International journal of computer vision, 88(2):303– 338, 2010. 6
work page 2010
-
[8]
I. Frederick Pearson. Map ProjectionsTheory and Applica- tions. CRC press, 1990. 4
work page 1990
-
[9]
C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [10]
-
[11]
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587,
-
[12]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on com- puter vision, pages 2961–2969, 2017. 1, 2
work page 2017
-
[13]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 2
work page 2016
- [14]
-
[15]
R. Khasanova and P. Frossard. Graph-based classification of omnidirectional images. In IEEE International Conference on Computer Vision Workshops (ICCVW) , pages 860–869,
-
[16]
Y . K. Lee, J. Jeong, J. S. Yun, C. W. June, and K.-J. Yoon. Spherephd: Applying cnns on a spherical polyhedron rep- resentation of 360 degree images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, 2019. 2
work page 2019
-
[17]
T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2
work page 2017
-
[18]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. InEuropean conference on computer vision, pages 740–755, 2014. 6
work page 2014
-
[19]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37,
- [20]
- [21]
-
[22]
J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. 2
work page 2017
-
[23]
YOLOv3: An Incremental Improvement
J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. arXiv preprint arXiv:1804.02767, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 1, 2, 5, 6
work page 2015
-
[25]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision , 115(3):211–252,
-
[26]
A. Shrivastava, A. Gupta, and R. Girshick. Training region- based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016. 2
work page 2016
-
[27]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 4, 5
work page 2015
-
[28]
J. P. Snyder. Flattening the earth: two thousand years of map projections. University of Chicago Press, 1997. 1
work page 1997
- [29]
- [30]
- [31]
-
[32]
J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Inter- national journal of computer vision , 104(2):154–171, 2013. 2
work page 2013
-
[33]
J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recogniz- ing scene viewpoint using panoramic place representation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2695–2702, 2012. 6
work page 2012
-
[34]
W. Yang, Y . Qian, F. Cricri, L. Fan, and J.-K. Kama- rainen. Object detection in equirectangular panorama. arXiv preprint arXiv:1805.08009, 2018. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Q. Zhao, C. Zhu, F. Dai, Y . Ma, G. Jin, and Y . Zhang. Distortion-aware cnns for spherical images. In International Joint Conferences on Artificial Intelligence , pages 1198– 1204, 2018. 2
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.