pith. sign in

arxiv: 1907.07477 · v1 · pith:BJIAQAJ3new · submitted 2019-07-17 · 💻 cs.CV

AVDNet: A Small-Sized Vehicle Detection Network for Aerial Visual Data

Pith reviewed 2026-05-24 20:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords small vehicle detectionaerial imageryone-stage object detectionresidual blocksconvolutional neural networkfeature preservationaerial dataset
0
0 comments X

The pith

AVDNet uses multi-scale residual blocks to detect small vehicles in aerial images with less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AVDNet, a one-stage detector built to handle the small size, complex backgrounds, and uniform appearance of vehicles seen from above. It inserts ConvRes residual blocks at several scales and enlarges the final feature map so that detail about tiny objects does not vanish in deeper layers. The network is evaluated on VEDAI, DLR-3K, DOTA and a newly annotated ABD set, where it reports higher mean average precision together with lower computation and memory use than prior detectors. A recurrent-feature visualization method is also presented to inspect internal behavior. If the blocks and enlarged map are the main drivers, then lightweight aerial detection becomes practical on platforms with tight power or storage limits.

Core claim

AVDNet is a one-stage vehicle detection network that places ConvRes residual blocks at multiple scales to counteract the loss of features for small objects that occurs with deeper convolutional layers. An enlarged output feature map works with these blocks to maintain robust representations of salient features for small-sized vehicles. The design is shown to raise mean average precision on VEDAI, DLR-3K, DOTA and the combined set that includes the new ABD collection while cutting both computation time and model size relative to existing techniques.

What carries the argument

ConvRes residual blocks inserted at multiple scales together with an enlarged output feature map that together preserve detail for small objects through the network.

If this is right

  • One-stage detectors can maintain small-object performance without adding depth that erases fine detail.
  • Lower computation and space complexity make the detector suitable for onboard aerial platforms.
  • The RFAV visualization technique provides a way to inspect how residual connections affect feature retention in aerial scenes.
  • A new annotated airborne dataset supplies additional examples of small vehicles for training and testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-scale residual pattern could be tested on other small-object tasks such as counting animals or infrastructure in satellite imagery.
  • If the enlarged feature map proves decisive, pairing it with different backbone networks might produce further efficiency gains.
  • Extending the static-image approach to video sequences would test whether motion cues add value beyond the spatial improvements shown.

Load-bearing premise

The reported gains in accuracy and efficiency come from the ConvRes blocks and enlarged feature map rather than from training choices or dataset quirks.

What would settle it

An ablation that removes the ConvRes blocks, keeps the same training protocol and datasets, and measures whether mean average precision drops on VEDAI or DOTA would falsify the claim if the drop is negligible.

Figures

Figures reproduced from arXiv: 1907.07477 by Manal Shah, Murari Mandal, Prashant Meena, Sanhita Devi, Santosh Kumar Vipparthi.

Figure 2
Figure 2. Figure 2: RFAV visualization of the ConvRes3 block of AVDNet. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample activation responses after each ConvRes block of AVDNet. The red boxes highlight the activations in different regions for the presence of vehicles in the input image, d = depth of the activation map. location (a, b) is calculated using the following equation: H(a,b) l (z) = d k=1 δ  Fk l (a, b) − z  ;z ∈[0, 255]. (5) 2) Feature Degradation Problem: Usually, the initial layers have detailed informa… view at source ↗
Figure 4
Figure 4. Figure 4: Precision-recall graph of the proposed and existing state-of-the-art object detectors over (a) VEDAI and (b) DLR-3K [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Precision-recall graph of the proposed and existing state-of-the-art object detectors over (a) DOTA and (b) Complete data set [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative performance of AVDNet under various challenging scenarios. First row: occlusion by overhead building/trees. Second row: vehicles of varying sizes and orientations. Third row: vehicles covered with shadows. (a) Input aerial image. (b) Vehicles detected by AVDNet. YOLOv2_608x608 in terms of mAP. As stated earlier, the enlarged dimensionality of the final tensor layer used in the proposed AVDNet l… view at source ↗
read the original abstract

Detection of small-sized targets in aerial views is a challenging task due to the smallness of vehicle size, complex background, and monotonic object appearances. In this letter, we propose a one-stage vehicle detection network (AVDNet) to robustly detect small-sized vehicles in aerial scenes. In AVDNet, we introduced ConvRes residual blocks at multiple scales to alleviate the problem of vanishing features for smaller objects caused because of the inclusion of deeper convolutional layers. These residual blocks, along with enlarged output feature map, ensure the robust representation of the salient features for small sized objects. Furthermore, we proposed a recurrent-feature aware visualization (RFAV) technique to analyze the network behavior. We also created a new airborne image data set (ABD) by annotating 1396 new objects in 79 aerial images for our experiments. The effectiveness of AVDNet is validated on VEDAI, DLR- 3K, DOTA, and the combined (VEDAI, DLR-3K, DOTA, and ABD) data set. Experimental results demonstrate the significant performance improvement of the proposed method over state-of-the-art detection techniques in terms of mAP, computation, and space complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes AVDNet, a one-stage detector for small vehicles in aerial imagery. It introduces ConvRes residual blocks at multiple scales plus an enlarged output feature map to mitigate vanishing features, a recurrent-feature aware visualization (RFAV) technique, and a new 79-image ABD dataset. Experiments on VEDAI, DLR-3K, DOTA and the combined set claim superior mAP together with lower computation and model size versus prior detectors.

Significance. If the reported gains can be shown to arise from the ConvRes blocks and enlarged feature map rather than from uncontrolled differences in training or baselines, the work would offer a compact architecture useful for real-time aerial vehicle detection. The new ABD set and RFAV tool are modest additional assets.

major comments (3)
  1. [Experiments] Experiments section: no ablation results are supplied that isolate the contribution of the ConvRes blocks or the enlarged output feature map (e.g., AVDNet minus ConvRes, or minus enlarged map). Without these controlled comparisons the headline claim that the architectural additions drive the mAP/complexity gains cannot be verified.
  2. [Experiments] Experiments section: the manuscript does not state that the cited state-of-the-art baselines were re-trained under identical optimizer, augmentation, learning-rate schedule and loss weighting as AVDNet. Observed deltas could therefore reflect implementation differences rather than the proposed components.
  3. [Dataset] Dataset section: the newly introduced ABD set contains only 79 images. When results are reported on the combined (VEDAI+DLR-3K+DOTA+ABD) collection, the small size of ABD limits the strength of any generalization claim.
minor comments (1)
  1. [Abstract] Abstract: 'DLR- 3K' contains an extraneous space before the numeral.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: Experiments section: no ablation results are supplied that isolate the contribution of the ConvRes blocks or the enlarged output feature map (e.g., AVDNet minus ConvRes, or minus enlarged map). Without these controlled comparisons the headline claim that the architectural additions drive the mAP/complexity gains cannot be verified.

    Authors: We agree that dedicated ablation studies would more directly isolate the effects of the ConvRes residual blocks and the enlarged output feature map. The manuscript demonstrates overall gains via end-to-end comparisons against published baselines on four datasets, but these do not decompose the individual contributions. We will add controlled ablation experiments (AVDNet variants with and without each component) to the revised Experiments section. revision: yes

  2. Referee: Experiments section: the manuscript does not state that the cited state-of-the-art baselines were re-trained under identical optimizer, augmentation, learning-rate schedule and loss weighting as AVDNet. Observed deltas could therefore reflect implementation differences rather than the proposed components.

    Authors: Baseline numbers are taken from the original publications, as is conventional when full re-implementation details are unavailable. Our own training protocol for AVDNet is described in detail. We recognize that this leaves room for implementation variance and will re-train the primary baselines (e.g., YOLO, SSD variants) under identical settings for the revision to enable a fairer head-to-head comparison. revision: yes

  3. Referee: Dataset section: the newly introduced ABD set contains only 79 images. When results are reported on the combined (VEDAI+DLR-3K+DOTA+ABD) collection, the small size of ABD limits the strength of any generalization claim.

    Authors: ABD (79 images, 1396 objects) was introduced as a modest supplementary set to increase scene diversity rather than as a large-scale benchmark. Individual results on VEDAI, DLR-3K and DOTA are also reported separately. We will revise the text to explicitly frame ABD as an auxiliary validation set and temper any generalization language regarding the combined collection. revision: partial

Circularity Check

0 steps flagged

No circularity: experimental claims rest on direct dataset comparisons without internal reductions

full rationale

The manuscript proposes AVDNet with ConvRes blocks and an enlarged feature map, then reports mAP and complexity numbers on VEDAI/DLR-3K/DOTA/ABD. No equations, fitted parameters, or predictions appear; the central claim is an empirical performance delta against prior detectors. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the architecture. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the assumption that residual blocks preserve small-object features in deep CNNs and that the chosen datasets adequately represent real aerial conditions; no independent evidence for these modeling choices is supplied in the abstract.

free parameters (1)
  • network scale factors and block placement
    The number and placement of ConvRes blocks and the enlargement factor of the output feature map are chosen by the authors; exact values not stated in abstract.
axioms (1)
  • domain assumption Residual connections alleviate vanishing features for small objects in deeper layers
    Invoked to justify the ConvRes design choice.
invented entities (2)
  • ConvRes residual blocks no independent evidence
    purpose: Preserve salient features of small vehicles across network depth
    Newly named component introduced for this task; no external validation cited.
  • RFAV visualization technique no independent evidence
    purpose: Analyze network attention via recurrent features
    Newly proposed method; no external validation cited.

pith-pipeline@v0.9.0 · 5760 in / 1393 out tokens · 24971 ms · 2026-05-24T20:33:23.529794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Convolutional Layer: Let I C(a, b) be an input image of size M × M; a ∈[ 1, M], b ∈[ 1, M] having C channels and f (·) is the filter with a kernel size h × h. The response of the convolutional layer ( conv) is computed by the following equation: Fd = C∑ j=1 f k(h) ∗ I n j + bk ⏐⏐⏐⏐ ⏐ ⏐ d k=1 (1) where bk is the bias, n ∈[ 1, M],a n d d is the filter depth. ...

  2. [2]

    These ConvRes residual features are studied at three different scales in the A VDNet

    ConvRes Blocks: The response of a ConvRes block consisting of three conv layers is computed using the following equation: Fd ConvRes = Fd l (a, b) + Fd l−2(a, b) (3) where l is the current conv feature layer. These ConvRes residual features are studied at three different scales in the A VDNet. The 1× 1 conv response along with the leaky ReLu introduces an...

  3. [3]

    2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown

    Recurrent-Feature Aware Visualization: In Fig. 2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown. For d feature maps in conv layer l, the RFA V representation is computed using the following equation: RFA V l(a, b) = arg maxz ( H(a,b) l (z) ) ; z ∈[ 0, 255] (4) where arg max (·) collects ...

  4. [4]

    These detailed features are very useful in the detection of small and dense objects

    Feature Degradation Problem: Usually, the initial layers have detailed information as compared to the features at the deeper layers. These detailed features are very useful in the detection of small and dense objects. In order to preserve the small-sized object features, we designed residual feature blocks at multiple scales in the A VDNet. These residual...

  5. [5]

    This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers

    Effect of Final Feature Map Resolution: The lower pixels-per-object values of the smaller objects cause the fea- tures to vanish in the deeper networks. This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers. For example, if the resolution of an image is 1024 × 1024, ...

  6. [6]

    This point was also reiterated by Lin et al

    Higher Pixels-Per-Object V alues: We have made another observation that the input layer size influences the network’s capability to learn the features for the small-size objects. This point was also reiterated by Lin et al. [22] in RetinaNet where they used 600-pixel and 800-pixel image scale as input to the network to improve the detection performance. In...

  7. [7]

    In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories

    VEDAI: VEDAI [27] data set contains aerial images captured from various scenario s for vehicle detection. In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories. The details of all the data sets (number of images, objects per class, etc.) are given in Table I

  8. [8]

    For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images

    DLR-3K: DLR-3K [1] is mainly comprised of scenes from urban and residential areas. For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images. We have manually annotated all the images in DLR-3K and generated 8401 horizontally aligned bounding boxes for all the objects. Finally, we selected 262 images with ...

  9. [9]

    In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat

    DOTA: DOTA [28] introduced a large-scale data set consisting of 2806 aerial images. In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat. Moreover, we manually annotated all the images in DOTA and generated 55 235 horizontally aligned bounding boxes as ground truth

  10. [10]

    The objects were annotated with four different classes: car, heavy vehicle, plane, and boat

    Airborne Data Set (ABD) Data Set: We collected 79 new aerial images from online sources and generated a new data set named ABD by annotating 1396 objects for our experiments. The objects were annotated with four different classes: car, heavy vehicle, plane, and boat

  11. [11]

    The complete data set is categorized into four classes similar to the DOTA and ABD data set

    Complete Data Set: For more comprehensive perfor- mance analysis of the proposed and existing object detectors in aerial scenes, we generated a large data set by combining VEDAI, DLR-3K, DOTA, and ABD data sets. The complete data set is categorized into four classes similar to the DOTA and ABD data set. The summary description of all the data sets is give...

  12. [12]

    The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes

    Implementation Details: The entire method is imple- mented in darknet. The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes. The threshold is the minimum object confidence score for which the network will detect an object. The object and class confidence values are com...

  13. [13]

    The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively

    Training Configuration: Training is done over a Titan Xp GPU system with stochastic gradient descent optimizer and minibatch size = 4. The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively. The training loss is calculated by taking the sum of square error from the final layer of the network, as given in [20]. We train our model wi...

  14. [14]

    The DLR-3K data set is divided with a ratio of ∼[80:20]

    Model Training: We divide VEDAI, DOTA, Complete data set into train and test set with a ratio of ∼[90:10]. The DLR-3K data set is divided with a ratio of ∼[80:20]. The A VDNet is trained over each data set without using any pretrained weights. The A VDNet detector is trained for ∼30k iterations over VEDAI, DOTA, Complete data set, and ∼15k iterations over...

  15. [15]

    We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values

    Quantitative Results: The performance measures of the proposed A VDNet and other state -of-the-art approaches for vehicle detection in VEDAI, DLR-3K, DOTA, and Complete data set is given in Table II. We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values. To ensure fair comparis...

  16. [16]

    Qualitative Results: We show the qualitative results of our approach to different challenging scenarios in Fig. 6. The detection responses from the original images are cropped out for appropriate visual representation. The A VDNet is able to detect vehicles, which are partially occluded by overhead building or trees, as shown in the first row in Fig. 6. Si...

  17. [17]

    Complexity Analysis: The computation and space com- plexity of the proposed method and existing state-of-the-art techniques is given in Table III. We can see that the proposed method uses approximately 1/5, 2/9, 5/14 times smaller num- ber of parameters as compared to YOLO (v2 and v3), Faster R-CNN, and RetinaNet, respectively. Similarly, the proposed A V...

  18. [18]

    Fast multiclass vehicle detection on aer- ial images,

    K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aer- ial images,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 9, pp. 1938–1942, Sep. 2015

  19. [19]

    Detection of cars in high-resolution aerial images of complex urban environments,

    M. ElMikaty and T. Stathaki, “Detection of cars in high-resolution aerial images of complex urban environments,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 10, pp. 5913–5924, Oct. 2017

  20. [20]

    An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,

    Y . Xu, G. Yu, X. Wu, Y . Wang, and Y . Ma, “An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,” IEEE Trans. Intell. Transp. Syst. , vol. 18, no. 7, pp. 1845–1856, Jul. 2017

  21. [21]

    Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning

    H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Nahavandi, “Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 12, pp. 7074–7085, Dec. 2018

  22. [22]

    ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,

    X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,” IEEE Trans. Geosci. Remote Sens. , to be published. doi: 10.1109/TGRS.2019.2897139

  23. [23]

    SLIC superpixels compared to state-of-the-art superpixel methods,

    R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 11, pp. 2274–2282, Nov. 2012

  24. [24]

    VCells: Simple and efficient superpixels using edge-weighted centroidal V oronoi tessellations,

    J. Wang and X. Wang, “VCells: Simple and efficient superpixels using edge-weighted centroidal V oronoi tessellations,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 6, pp. 1241–1247, Jun. 2012

  25. [25]

    Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,

    W. Zhang, X. Sun, K. Fu, C. Wa ng, and H. Wang, “Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,” IEEE Geosci. Remote Sens. Lett. , vol. 11, no. 1, pp. 74–78, Jan. 2014

  26. [26]

    A S IFT-SVM method for detecting cars in UA V images,

    T. Moranduzzo and F. Melgani, “A S IFT-SVM method for detecting cars in UA V images,” in Proc. IEEE IGARSS , Jul. 2012, pp. 6868–6871

  27. [27]

    Vehicle detection in high-resolution aerial images based on fast sparse representation classification and multiorder feature,

    Z. Chen et al. , “Vehicle detection in high-resolution aerial images based on fast sparse representation classification and multiorder feature,” IEEE Trans. Intell. Transp. Syst. , vol. 17, no. 8, pp. 2296–2309, Aug. 2016

  28. [28]

    Vehicle detection in high-resolution aerial images via sparse representation and superpixels,

    Z. Chen et al. , “Vehicle detection in high-resolution aerial images via sparse representation and superpixels,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 1, pp. 103–116, Jan. 2016

  29. [29]

    Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,

    Y . Yu, H. Guan, and Z. Ji, “Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 11, pp. 2183–2187, Nov. 2015

  30. [30]

    Deep multi-modal vehicle detection in aerial ISR imagery,

    W. Sakla, G. Konjevod, and T. N. Mundhenk, “Deep multi-modal vehicle detection in aerial ISR imagery,” in Proc. IEEE WACV, Mar. 2017, pp. 916–923

  31. [31]

    Fast deep vehicle detection in aerial images,

    L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep vehicle detection in aerial images,” in Proc. IEEE WACV, Mar. 2017, pp. 311–319

  32. [32]

    Semantic labeling based vehicle detection in aerial imagery,

    K. Nie, L. Sommer, A. Schumann, and J. Beyerer, “Semantic labeling based vehicle detection in aerial imagery,” in Proc. IEEE WACV, Mar. 2018, pp. 626–634

  33. [33]

    Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,

    Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. , vol. 10, no. 8, pp. 3652–3664, Aug. 2017

  34. [34]

    Accurate object localization in remote sensing images based on convolutional neural networks,

    Y . Long, Y . Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens. , vol. 55, no. 5, pp. 2486–2498, May 2017

  35. [35]

    Faster R-CNN: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 6, pp. 1137–1149, Jun. 2017

  36. [36]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Pr oc. CVPR, 2016, pp. 779–788

  37. [37]

    YOLO9000: Better, faster, stronger,

    J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. CVPR, 2017, pp. 7263–7271

  38. [38]

    YOLOv3: An Incremental Improvement

    J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- ment,” 2018, arXiv:1804.02767. [Online]. Available: https://arxiv. org/abs/1804.02767

  39. [39]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. ICCV, 2017, pp. 2980–2988

  40. [40]

    Rotation-insensitive and context- augmented object detection in remote sensing images,

    K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context- augmented object detection in remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 4, pp. 2337–2348, Apr. 2018

  41. [41]

    Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,

    G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 54, no. 12, pp. 7405–7415, Dec. 2016

  42. [42]

    Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,

    G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 265–278, Jan. 2019

  43. [43]

    Efficient saliency-based object detection in remote sensing images using deep belief networks,

    W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu, “Efficient saliency-based object detection in remote sensing images using deep belief networks,” IEEE Geosci. Remote Sens. Lett. , vol. 13, no. 2, pp. 137–141, Feb. 2016

  44. [44]

    Vehicle detection in aerial imagery: A small target detection benchmark,

    S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” J. Vis. Commun. Image Represent., vol. 34, pp. 187–203, Jan. 2016

  45. [45]

    DOTA: A large-scale dataset for object detection in aerial images,

    G. S. Xia et al. , “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. CVPR, 2018, pp. 3974–3983