pith. sign in

arxiv: 1907.11830 · v1 · pith:7GYWZM6Onew · submitted 2019-07-27 · 💻 cs.CV

Reprojection R-CNN: A Fast and Accurate Object Detector for 360{deg} Images

Pith reviewed 2026-05-24 15:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords 360 degree imagesobject detectionequirectangular projectionperspective projectionregion proposalR-CNNsynthetic datasets
0
0 comments X

The pith

Reprojection R-CNN detects objects more accurately in 360-degree images by reprojecting region proposals from equirectangular to perspective views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create an effective object detector for 360-degree images that overcomes the distortion in equirectangular representations and the limited field of view in perspective ones. It does this by first finding rough object locations across the entire spherical view using a specialized proposal network, then adjusting those locations by converting them into flat perspective images where standard detection works better. This matters because 360-degree images are increasingly used in areas like virtual reality and robotics, yet detection has lagged due to these projection issues and a shortage of training data. The authors also build two new synthetic datasets to support training and testing of such detectors.

Core claim

The central discovery is a two-stage detector called Reprojection R-CNN for 360° images. It first generates coarse region proposals efficiently with a distortion-aware spherical region proposal network on the equirectangular projection, which has a full omnidirectional field-of-view. Then a novel reprojection network refines the proposed regions using the distortion-free perspective projection. Two novel synthetic datasets are introduced for training and evaluation. The detector achieves better mean average precision than prior methods and processes each image in 178 milliseconds.

What carries the argument

The reprojection network that converts and refines region proposals from the equirectangular projection to perspective projections.

Load-bearing premise

The two synthetic datasets sufficiently represent the distortion patterns and labeling challenges found in actual 360-degree images, allowing the reprojection step to improve proposals without adding biases.

What would settle it

A direct comparison of detection accuracy on a collection of real 360-degree images with ground-truth annotations, checking whether the full Reprojection R-CNN exceeds the accuracy of its components without the reprojection network.

Figures

Figures reproduced from arXiv: 1907.11830 by Ansheng You, Jiaying Liu, Kaigui Bian, Pengyu Zhao, Yuanxing Zhang, Yunhai Tong.

Figure 1
Figure 1. Figure 1: The challenges of the 360◦ object detection task. Ob￾jects in ERP suffer from severe distortion as well as discontinuity on the borders. Besides, the object can hardly be recognized with only a few perspective projections. It is also obvious that rectan￾gular bounding box (blue outline) is not appropriate for this task. Despite the challenges, Rep R-CNN is capable of detecting objects in 360◦ images, as sh… view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of the relationship among ERP, sphere, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of the proposed Reprojection R-CNN: The first stage employs SphVGG backbone to generate coarse proposals; [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Polar angle/mAP curves of Rep R￾CNN and three baselines on COCO-Men. -90 -60 -30 -0 30 60 90 latitude 0 20 40 60 mAP SphRPN(COCO-Men) SphRPN(VOC360) RPN(COCO-Men) RPN(VOC360) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: More results of Rep R-CNN on the three datasets. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

360{\deg} images are usually represented in either equirectangular projection (ERP) or multiple perspective projections. Different from the flat 2D images, the detection task is challenging for 360{\deg} images due to the distortion of ERP and the inefficiency of perspective projections. However, existing methods mostly focus on one of the above representations instead of both, leading to limited detection performance. Moreover, the lack of appropriate bounding-box annotations as well as the annotated datasets further increases the difficulties of the detection task. In this paper, we present a standard object detection framework for 360{\deg} images. Specifically, we adapt the terminologies of the traditional object detection task to the omnidirectional scenarios, and propose a novel two-stage object detector, i.e., Reprojection R-CNN by combining both ERP and perspective projection. Owing to the omnidirectional field-of-view of ERP, Reprojection R-CNN first generates coarse region proposals efficiently by a distortion-aware spherical region proposal network. Then, it leverages the distortion-free perspective projection and refines the proposed regions by a novel reprojection network. We construct two novel synthetic datasets for training and evaluation. Experiments reveal that Reprojection R-CNN outperforms the previous state-of-the-art methods on the mAP metric. In addition, the proposed detector could run at 178ms per image in the panoramic datasets, which implies its practicability in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Reprojection R-CNN, a two-stage detector for 360° images that first generates coarse region proposals via a distortion-aware spherical RPN operating on equirectangular projections (ERP) and then refines them using a reprojection network on distortion-free perspective projections. It introduces two novel synthetic datasets to address the lack of annotated 360° data and reports that the method outperforms prior state-of-the-art approaches on mAP while running at 178 ms per image on the panoramic datasets.

Significance. If the results hold, the work supplies a concrete architecture that explicitly combines the wide field-of-view of ERP with the geometric fidelity of perspective views, together with new synthetic training resources; this could serve as a reusable baseline for omnidirectional detection.

major comments (1)
  1. [Experiments] Experiments section: all reported mAP gains and the 178 ms runtime are obtained exclusively on the two synthetic datasets created by the authors. No results, ablations, or cross-dataset tests appear on real equirectangular photographs, so the central claim that the detector is practical for 360° images rests on an unverified assumption that the synthetic distortion and annotation statistics match real imagery.
minor comments (1)
  1. Abstract: the claim of outperformance is stated without naming the baselines or reporting the magnitude of the mAP improvement; adding one sentence with these numbers would improve readability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting the scope of our experimental evaluation. We agree that the absence of results on real equirectangular images is a limitation and will revise the manuscript accordingly while preserving the core technical contributions.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: all reported mAP gains and the 178 ms runtime are obtained exclusively on the two synthetic datasets created by the authors. No results, ablations, or cross-dataset tests appear on real equirectangular photographs, so the central claim that the detector is practical for 360° images rests on an unverified assumption that the synthetic distortion and annotation statistics match real imagery.

    Authors: We acknowledge that all reported quantitative results, including mAP improvements and the 178 ms runtime, are obtained solely on the two synthetic datasets introduced in the paper. These datasets were constructed precisely because of the documented scarcity of annotated real 360° imagery (see Introduction). The rendering pipeline was designed to reproduce the geometric distortions of ERP images, and the method itself is representation-agnostic. Nevertheless, we agree that the claim of practicality for real-world 360° images would be stronger with real-data validation. In the revision we will (1) add an explicit Limitations subsection discussing the synthetic-to-real domain gap, (2) moderate the language around “practicability in real-world applications” to refer specifically to the panoramic synthetic setting, and (3) include qualitative examples on publicly available unlabeled real panoramas to illustrate generalization behavior. No new quantitative results on real annotated data can be added without additional data collection. revision: partial

standing simulated objections not resolved
  • Providing quantitative mAP or runtime numbers on real annotated equirectangular photographs, as no such experiments were performed and no suitable public datasets with bounding-box annotations were available to the authors.

Circularity Check

0 steps flagged

No significant circularity; architecture and evaluation are self-contained.

full rationale

The paper proposes a novel two-stage detector (distortion-aware spherical RPN followed by reprojection network) that adapts standard detection terminology to omnidirectional images and is validated on two newly constructed synthetic datasets. No equations, parameters, or performance metrics reduce by construction to inputs defined in the authors' prior work; the mAP and runtime claims are direct experimental outputs on the introduced data rather than fitted quantities renamed as predictions. No load-bearing self-citations or uniqueness theorems imported from the same authors appear in the derivation chain. The central result is therefore independent of the patterns that would trigger circularity scores above 2.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed two-stage architecture and the representativeness of the new synthetic datasets; both are introduced in the paper rather than drawn from prior literature with independent verification.

axioms (1)
  • domain assumption Convolutional features remain useful when networks are adapted for spherical distortion in equirectangular images.
    The distortion-aware RPN and reprojection steps presuppose that standard CNN backbones can be modified to handle projection-specific geometry.

pith-pipeline@v0.9.0 · 5803 in / 1208 out tokens · 36289 ms · 2026-05-24T15:02:54.494097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

  1. [1]

    Ardouin, A

    J. Ardouin, A. L ´ecuyer, M. Marchal, C. Riant, and E. Marc- hand. Flyviz: a novel display device to provide humans with 360 vision by coupling catadioptric camera with hmd. In Proceedings of the 18th ACM symposium on Virtual reality software and technology, pages 41–44, 2012. 1

  2. [2]

    Y . Chen, J. Wang, J. Li, C. Lu, Z. Luo, H. Xue, and C. Wang. Lidar-video driving dataset: Learning driving policies effec- tively. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5870–5878, 2018. 1

  3. [3]

    T. S. Cohen, M. Geiger, J. K¨ohler, and M. Welling. Spherical cnns. In ICLR, 2018. 2, 4, 6, 7

  4. [4]

    Coors, A

    B. Coors, A. Paul Condurache, and A. Geiger. Spherenet: Learning spherical representations for detection and classifi- cation in omnidirectional images. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 518– 533, 2018. 2, 3, 4, 6, 7, 8 Figure 7. More results of Rep R-CNN on the three datasets

  5. [5]

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 1(2):3, 2017. 2

  6. [6]

    Esteves, C

    C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Dani- ilidis. Learning so (3) equivariant representations with spher- ical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–68, 2018. 2, 6, 7

  7. [7]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) chal- lenge. International journal of computer vision, 88(2):303– 338, 2010. 6

  8. [8]

    Frederick Pearson

    I. Frederick Pearson. Map ProjectionsTheory and Applica- tions. CRC press, 1990. 4

  9. [9]

    C.-Y . Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 2

  10. [10]

    Girshick

    R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- national conference on computer vision , pages 1440–1448,

  11. [11]

    Girshick, J

    R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 580–587,

  12. [12]

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on com- puter vision, pages 2961–2969, 2017. 1, 2

  13. [13]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016. 2

  14. [14]

    Huang, Z

    J. Huang, Z. Chen, D. Ceylan, and H. Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44, 2017. 1

  15. [15]

    Khasanova and P

    R. Khasanova and P. Frossard. Graph-based classification of omnidirectional images. In IEEE International Conference on Computer Vision Workshops (ICCVW) , pages 860–869,

  16. [16]

    Y . K. Lee, J. Jeong, J. S. Yun, C. W. June, and K.-J. Yoon. Spherephd: Applying cnns on a spherical polyhedron rep- resentation of 360 degree images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, 2019. 2

  17. [17]

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2

  18. [18]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. InEuropean conference on computer vision, pages 740–755, 2014. 6

  19. [19]

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37,

  20. [20]

    Qiu and A

    W. Qiu and A. Yuille. Unrealcv: Connecting computer vi- sion to unreal engine. In European Conference on Computer Vision, pages 909–916, 2016. 1

  21. [21]

    Redmon, S

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 779–788, 2016. 2

  22. [22]

    Redmon and A

    J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017. 2

  23. [23]

    YOLOv3: An Incremental Improvement

    J. Redmon and A. Farhadi. Yolov3: An incremental improve- ment. arXiv preprint arXiv:1804.02767, 2018. 2

  24. [24]

    S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015. 1, 2, 5, 6

  25. [25]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision , 115(3):211–252,

  26. [26]

    Shrivastava, A

    A. Shrivastava, A. Gupta, and R. Girshick. Training region- based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016. 2

  27. [27]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 4, 5

  28. [28]

    J. P. Snyder. Flattening the earth: two thousand years of map projections. University of Chicago Press, 1997. 1

  29. [29]

    Su and K

    Y .-C. Su and K. Grauman. Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, pages 529–539, 2017. 2, 3, 4, 6, 7

  30. [30]

    Su and K

    Y .-C. Su and K. Grauman. Kernel transformer networks for compact spherical convolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4

  31. [31]

    Tateno, N

    K. Tateno, N. Navab, and F. Tombari. Distortion-aware con- volutional filters for dense prediction in panoramic images. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 707–722, 2018. 2

  32. [32]

    J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Inter- national journal of computer vision , 104(2):154–171, 2013. 2

  33. [33]

    J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recogniz- ing scene viewpoint using panoramic place representation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2695–2702, 2012. 6

  34. [34]

    W. Yang, Y . Qian, F. Cricri, L. Fan, and J.-K. Kama- rainen. Object detection in equirectangular panorama. arXiv preprint arXiv:1805.08009, 2018. 2, 3, 4

  35. [35]

    Q. Zhao, C. Zhu, F. Dai, Y . Ma, G. Jin, and Y . Zhang. Distortion-aware cnns for spherical images. In International Joint Conferences on Artificial Intelligence , pages 1198– 1204, 2018. 2