AVDNet: A Small-Sized Vehicle Detection Network for Aerial Visual Data

Manal Shah; Murari Mandal; Prashant Meena; Sanhita Devi; Santosh Kumar Vipparthi

arxiv: 1907.07477 · v1 · pith:BJIAQAJ3new · submitted 2019-07-17 · 💻 cs.CV

AVDNet: A Small-Sized Vehicle Detection Network for Aerial Visual Data

Murari Mandal , Manal Shah , Prashant Meena , Sanhita Devi , Santosh Kumar Vipparthi This is my paper

Pith reviewed 2026-05-24 20:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords small vehicle detectionaerial imageryone-stage object detectionresidual blocksconvolutional neural networkfeature preservationaerial dataset

0 comments

The pith

AVDNet uses multi-scale residual blocks to detect small vehicles in aerial images with less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AVDNet, a one-stage detector built to handle the small size, complex backgrounds, and uniform appearance of vehicles seen from above. It inserts ConvRes residual blocks at several scales and enlarges the final feature map so that detail about tiny objects does not vanish in deeper layers. The network is evaluated on VEDAI, DLR-3K, DOTA and a newly annotated ABD set, where it reports higher mean average precision together with lower computation and memory use than prior detectors. A recurrent-feature visualization method is also presented to inspect internal behavior. If the blocks and enlarged map are the main drivers, then lightweight aerial detection becomes practical on platforms with tight power or storage limits.

Core claim

AVDNet is a one-stage vehicle detection network that places ConvRes residual blocks at multiple scales to counteract the loss of features for small objects that occurs with deeper convolutional layers. An enlarged output feature map works with these blocks to maintain robust representations of salient features for small-sized vehicles. The design is shown to raise mean average precision on VEDAI, DLR-3K, DOTA and the combined set that includes the new ABD collection while cutting both computation time and model size relative to existing techniques.

What carries the argument

ConvRes residual blocks inserted at multiple scales together with an enlarged output feature map that together preserve detail for small objects through the network.

If this is right

One-stage detectors can maintain small-object performance without adding depth that erases fine detail.
Lower computation and space complexity make the detector suitable for onboard aerial platforms.
The RFAV visualization technique provides a way to inspect how residual connections affect feature retention in aerial scenes.
A new annotated airborne dataset supplies additional examples of small vehicles for training and testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-scale residual pattern could be tested on other small-object tasks such as counting animals or infrastructure in satellite imagery.
If the enlarged feature map proves decisive, pairing it with different backbone networks might produce further efficiency gains.
Extending the static-image approach to video sequences would test whether motion cues add value beyond the spatial improvements shown.

Load-bearing premise

The reported gains in accuracy and efficiency come from the ConvRes blocks and enlarged feature map rather than from training choices or dataset quirks.

What would settle it

An ablation that removes the ConvRes blocks, keeps the same training protocol and datasets, and measures whether mean average precision drops on VEDAI or DOTA would falsify the claim if the drop is negligible.

Figures

Figures reproduced from arXiv: 1907.07477 by Manal Shah, Murari Mandal, Prashant Meena, Sanhita Devi, Santosh Kumar Vipparthi.

**Figure 3.** Figure 3: Sample activation responses after each ConvRes block of AVDNet. The red boxes highlight the activations in different regions for the presence of vehicles in the input image, d = depth of the activation map. location (a, b) is calculated using the following equation: H(a,b) l (z) = d k=1 δ Fk l (a, b) − z ;z ∈[0, 255]. (5) 2) Feature Degradation Problem: Usually, the initial layers have detailed informa… view at source ↗

**Figure 4.** Figure 4: Precision-recall graph of the proposed and existing state-of-the-art object detectors over (a) VEDAI and (b) DLR-3K [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Precision-recall graph of the proposed and existing state-of-the-art object detectors over (a) DOTA and (b) Complete data set [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative performance of AVDNet under various challenging scenarios. First row: occlusion by overhead building/trees. Second row: vehicles of varying sizes and orientations. Third row: vehicles covered with shadows. (a) Input aerial image. (b) Vehicles detected by AVDNet. YOLOv2_608x608 in terms of mAP. As stated earlier, the enlarged dimensionality of the final tensor layer used in the proposed AVDNet l… view at source ↗

read the original abstract

Detection of small-sized targets in aerial views is a challenging task due to the smallness of vehicle size, complex background, and monotonic object appearances. In this letter, we propose a one-stage vehicle detection network (AVDNet) to robustly detect small-sized vehicles in aerial scenes. In AVDNet, we introduced ConvRes residual blocks at multiple scales to alleviate the problem of vanishing features for smaller objects caused because of the inclusion of deeper convolutional layers. These residual blocks, along with enlarged output feature map, ensure the robust representation of the salient features for small sized objects. Furthermore, we proposed a recurrent-feature aware visualization (RFAV) technique to analyze the network behavior. We also created a new airborne image data set (ABD) by annotating 1396 new objects in 79 aerial images for our experiments. The effectiveness of AVDNet is validated on VEDAI, DLR- 3K, DOTA, and the combined (VEDAI, DLR-3K, DOTA, and ABD) data set. Experimental results demonstrate the significant performance improvement of the proposed method over state-of-the-art detection techniques in terms of mAP, computation, and space complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes AVDNet, a one-stage detector for small vehicles in aerial imagery. It introduces ConvRes residual blocks at multiple scales plus an enlarged output feature map to mitigate vanishing features, a recurrent-feature aware visualization (RFAV) technique, and a new 79-image ABD dataset. Experiments on VEDAI, DLR-3K, DOTA and the combined set claim superior mAP together with lower computation and model size versus prior detectors.

Significance. If the reported gains can be shown to arise from the ConvRes blocks and enlarged feature map rather than from uncontrolled differences in training or baselines, the work would offer a compact architecture useful for real-time aerial vehicle detection. The new ABD set and RFAV tool are modest additional assets.

major comments (3)

[Experiments] Experiments section: no ablation results are supplied that isolate the contribution of the ConvRes blocks or the enlarged output feature map (e.g., AVDNet minus ConvRes, or minus enlarged map). Without these controlled comparisons the headline claim that the architectural additions drive the mAP/complexity gains cannot be verified.
[Experiments] Experiments section: the manuscript does not state that the cited state-of-the-art baselines were re-trained under identical optimizer, augmentation, learning-rate schedule and loss weighting as AVDNet. Observed deltas could therefore reflect implementation differences rather than the proposed components.
[Dataset] Dataset section: the newly introduced ABD set contains only 79 images. When results are reported on the combined (VEDAI+DLR-3K+DOTA+ABD) collection, the small size of ABD limits the strength of any generalization claim.

minor comments (1)

[Abstract] Abstract: 'DLR- 3K' contains an extraneous space before the numeral.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: Experiments section: no ablation results are supplied that isolate the contribution of the ConvRes blocks or the enlarged output feature map (e.g., AVDNet minus ConvRes, or minus enlarged map). Without these controlled comparisons the headline claim that the architectural additions drive the mAP/complexity gains cannot be verified.

Authors: We agree that dedicated ablation studies would more directly isolate the effects of the ConvRes residual blocks and the enlarged output feature map. The manuscript demonstrates overall gains via end-to-end comparisons against published baselines on four datasets, but these do not decompose the individual contributions. We will add controlled ablation experiments (AVDNet variants with and without each component) to the revised Experiments section. revision: yes
Referee: Experiments section: the manuscript does not state that the cited state-of-the-art baselines were re-trained under identical optimizer, augmentation, learning-rate schedule and loss weighting as AVDNet. Observed deltas could therefore reflect implementation differences rather than the proposed components.

Authors: Baseline numbers are taken from the original publications, as is conventional when full re-implementation details are unavailable. Our own training protocol for AVDNet is described in detail. We recognize that this leaves room for implementation variance and will re-train the primary baselines (e.g., YOLO, SSD variants) under identical settings for the revision to enable a fairer head-to-head comparison. revision: yes
Referee: Dataset section: the newly introduced ABD set contains only 79 images. When results are reported on the combined (VEDAI+DLR-3K+DOTA+ABD) collection, the small size of ABD limits the strength of any generalization claim.

Authors: ABD (79 images, 1396 objects) was introduced as a modest supplementary set to increase scene diversity rather than as a large-scale benchmark. Individual results on VEDAI, DLR-3K and DOTA are also reported separately. We will revise the text to explicitly frame ABD as an auxiliary validation set and temper any generalization language regarding the combined collection. revision: partial

Circularity Check

0 steps flagged

No circularity: experimental claims rest on direct dataset comparisons without internal reductions

full rationale

The manuscript proposes AVDNet with ConvRes blocks and an enlarged feature map, then reports mAP and complexity numbers on VEDAI/DLR-3K/DOTA/ABD. No equations, fitted parameters, or predictions appear; the central claim is an empirical performance delta against prior detectors. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the architecture. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the assumption that residual blocks preserve small-object features in deep CNNs and that the chosen datasets adequately represent real aerial conditions; no independent evidence for these modeling choices is supplied in the abstract.

free parameters (1)

network scale factors and block placement
The number and placement of ConvRes blocks and the enlargement factor of the output feature map are chosen by the authors; exact values not stated in abstract.

axioms (1)

domain assumption Residual connections alleviate vanishing features for small objects in deeper layers
Invoked to justify the ConvRes design choice.

invented entities (2)

ConvRes residual blocks no independent evidence
purpose: Preserve salient features of small vehicles across network depth
Newly named component introduced for this task; no external validation cited.
RFAV visualization technique no independent evidence
purpose: Analyze network attention via recurrent features
Newly proposed method; no external validation cited.

pith-pipeline@v0.9.0 · 5760 in / 1393 out tokens · 24971 ms · 2026-05-24T20:33:23.529794+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Convolutional Layer: Let I C(a, b) be an input image of size M × M; a ∈[ 1, M], b ∈[ 1, M] having C channels and f (·) is the ﬁlter with a kernel size h × h. The response of the convolutional layer ( conv) is computed by the following equation: Fd = C∑ j=1 f k(h) ∗ I n j + bk ⏐⏐⏐⏐ ⏐ ⏐ d k=1 (1) where bk is the bias, n ∈[ 1, M],a n d d is the ﬁlter depth. ...

work page
[2]

These ConvRes residual features are studied at three different scales in the A VDNet

ConvRes Blocks: The response of a ConvRes block consisting of three conv layers is computed using the following equation: Fd ConvRes = Fd l (a, b) + Fd l−2(a, b) (3) where l is the current conv feature layer. These ConvRes residual features are studied at three different scales in the A VDNet. The 1× 1 conv response along with the leaky ReLu introduces an...

work page
[3]

2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown

Recurrent-Feature Aware Visualization: In Fig. 2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown. For d feature maps in conv layer l, the RFA V representation is computed using the following equation: RFA V l(a, b) = arg maxz ( H(a,b) l (z) ) ; z ∈[ 0, 255] (4) where arg max (·) collects ...

work page
[4]

These detailed features are very useful in the detection of small and dense objects

Feature Degradation Problem: Usually, the initial layers have detailed information as compared to the features at the deeper layers. These detailed features are very useful in the detection of small and dense objects. In order to preserve the small-sized object features, we designed residual feature blocks at multiple scales in the A VDNet. These residual...

work page
[5]

This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers

Effect of Final Feature Map Resolution: The lower pixels-per-object values of the smaller objects cause the fea- tures to vanish in the deeper networks. This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers. For example, if the resolution of an image is 1024 × 1024, ...

work page
[6]

This point was also reiterated by Lin et al

Higher Pixels-Per-Object V alues: We have made another observation that the input layer size inﬂuences the network’s capability to learn the features for the small-size objects. This point was also reiterated by Lin et al. [22] in RetinaNet where they used 600-pixel and 800-pixel image scale as input to the network to improve the detection performance. In...

work page
[7]

In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories

VEDAI: VEDAI [27] data set contains aerial images captured from various scenario s for vehicle detection. In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories. The details of all the data sets (number of images, objects per class, etc.) are given in Table I

work page
[8]

For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images

DLR-3K: DLR-3K [1] is mainly comprised of scenes from urban and residential areas. For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images. We have manually annotated all the images in DLR-3K and generated 8401 horizontally aligned bounding boxes for all the objects. Finally, we selected 262 images with ...

work page
[9]

In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat

DOTA: DOTA [28] introduced a large-scale data set consisting of 2806 aerial images. In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat. Moreover, we manually annotated all the images in DOTA and generated 55 235 horizontally aligned bounding boxes as ground truth

work page
[10]

The objects were annotated with four different classes: car, heavy vehicle, plane, and boat

Airborne Data Set (ABD) Data Set: We collected 79 new aerial images from online sources and generated a new data set named ABD by annotating 1396 objects for our experiments. The objects were annotated with four different classes: car, heavy vehicle, plane, and boat

work page
[11]

The complete data set is categorized into four classes similar to the DOTA and ABD data set

Complete Data Set: For more comprehensive perfor- mance analysis of the proposed and existing object detectors in aerial scenes, we generated a large data set by combining VEDAI, DLR-3K, DOTA, and ABD data sets. The complete data set is categorized into four classes similar to the DOTA and ABD data set. The summary description of all the data sets is give...

work page
[12]

The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes

Implementation Details: The entire method is imple- mented in darknet. The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes. The threshold is the minimum object conﬁdence score for which the network will detect an object. The object and class conﬁdence values are com...

work page
[13]

The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively

Training Conﬁguration: Training is done over a Titan Xp GPU system with stochastic gradient descent optimizer and minibatch size = 4. The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively. The training loss is calculated by taking the sum of square error from the ﬁnal layer of the network, as given in [20]. We train our model wi...

work page
[14]

The DLR-3K data set is divided with a ratio of ∼[80:20]

Model Training: We divide VEDAI, DOTA, Complete data set into train and test set with a ratio of ∼[90:10]. The DLR-3K data set is divided with a ratio of ∼[80:20]. The A VDNet is trained over each data set without using any pretrained weights. The A VDNet detector is trained for ∼30k iterations over VEDAI, DOTA, Complete data set, and ∼15k iterations over...

work page
[15]

We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values

Quantitative Results: The performance measures of the proposed A VDNet and other state -of-the-art approaches for vehicle detection in VEDAI, DLR-3K, DOTA, and Complete data set is given in Table II. We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values. To ensure fair comparis...

work page
[16]

Qualitative Results: We show the qualitative results of our approach to different challenging scenarios in Fig. 6. The detection responses from the original images are cropped out for appropriate visual representation. The A VDNet is able to detect vehicles, which are partially occluded by overhead building or trees, as shown in the ﬁrst row in Fig. 6. Si...

work page
[17]

Complexity Analysis: The computation and space com- plexity of the proposed method and existing state-of-the-art techniques is given in Table III. We can see that the proposed method uses approximately 1/5, 2/9, 5/14 times smaller num- ber of parameters as compared to YOLO (v2 and v3), Faster R-CNN, and RetinaNet, respectively. Similarly, the proposed A V...

work page
[18]

Fast multiclass vehicle detection on aer- ial images,

K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aer- ial images,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 9, pp. 1938–1942, Sep. 2015

work page 1938
[19]

Detection of cars in high-resolution aerial images of complex urban environments,

M. ElMikaty and T. Stathaki, “Detection of cars in high-resolution aerial images of complex urban environments,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 10, pp. 5913–5924, Oct. 2017

work page 2017
[20]

An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,

Y . Xu, G. Yu, X. Wu, Y . Wang, and Y . Ma, “An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,” IEEE Trans. Intell. Transp. Syst. , vol. 18, no. 7, pp. 1845–1856, Jul. 2017

work page 2017
[21]

Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning

H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Nahavandi, “Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 12, pp. 7074–7085, Dec. 2018

work page 2018
[22]

ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,

X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,” IEEE Trans. Geosci. Remote Sens. , to be published. doi: 10.1109/TGRS.2019.2897139

work page doi:10.1109/tgrs.2019.2897139 2019
[23]

SLIC superpixels compared to state-of-the-art superpixel methods,

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 11, pp. 2274–2282, Nov. 2012

work page 2012
[24]

VCells: Simple and efﬁcient superpixels using edge-weighted centroidal V oronoi tessellations,

J. Wang and X. Wang, “VCells: Simple and efﬁcient superpixels using edge-weighted centroidal V oronoi tessellations,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 6, pp. 1241–1247, Jun. 2012

work page 2012
[25]

Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,

W. Zhang, X. Sun, K. Fu, C. Wa ng, and H. Wang, “Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,” IEEE Geosci. Remote Sens. Lett. , vol. 11, no. 1, pp. 74–78, Jan. 2014

work page 2014
[26]

A S IFT-SVM method for detecting cars in UA V images,

T. Moranduzzo and F. Melgani, “A S IFT-SVM method for detecting cars in UA V images,” in Proc. IEEE IGARSS , Jul. 2012, pp. 6868–6871

work page 2012
[27]

Vehicle detection in high-resolution aerial images based on fast sparse representation classiﬁcation and multiorder feature,

Z. Chen et al. , “Vehicle detection in high-resolution aerial images based on fast sparse representation classiﬁcation and multiorder feature,” IEEE Trans. Intell. Transp. Syst. , vol. 17, no. 8, pp. 2296–2309, Aug. 2016

work page 2016
[28]

Vehicle detection in high-resolution aerial images via sparse representation and superpixels,

Z. Chen et al. , “Vehicle detection in high-resolution aerial images via sparse representation and superpixels,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 1, pp. 103–116, Jan. 2016

work page 2016
[29]

Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,

Y . Yu, H. Guan, and Z. Ji, “Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 11, pp. 2183–2187, Nov. 2015

work page 2015
[30]

Deep multi-modal vehicle detection in aerial ISR imagery,

W. Sakla, G. Konjevod, and T. N. Mundhenk, “Deep multi-modal vehicle detection in aerial ISR imagery,” in Proc. IEEE WACV, Mar. 2017, pp. 916–923

work page 2017
[31]

Fast deep vehicle detection in aerial images,

L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep vehicle detection in aerial images,” in Proc. IEEE WACV, Mar. 2017, pp. 311–319

work page 2017
[32]

Semantic labeling based vehicle detection in aerial imagery,

K. Nie, L. Sommer, A. Schumann, and J. Beyerer, “Semantic labeling based vehicle detection in aerial imagery,” in Proc. IEEE WACV, Mar. 2018, pp. 626–634

work page 2018
[33]

Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,

Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. , vol. 10, no. 8, pp. 3652–3664, Aug. 2017

work page 2017
[34]

Accurate object localization in remote sensing images based on convolutional neural networks,

Y . Long, Y . Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens. , vol. 55, no. 5, pp. 2486–2498, May 2017

work page 2017
[35]

Faster R-CNN: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 6, pp. 1137–1149, Jun. 2017

work page 2017
[36]

You only look once: Uniﬁed, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uniﬁed, real-time object detection,” in Pr oc. CVPR, 2016, pp. 779–788

work page 2016
[37]

YOLO9000: Better, faster, stronger,

J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. CVPR, 2017, pp. 7263–7271

work page 2017
[38]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- ment,” 2018, arXiv:1804.02767. [Online]. Available: https://arxiv. org/abs/1804.02767

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. ICCV, 2017, pp. 2980–2988

work page 2017
[40]

Rotation-insensitive and context- augmented object detection in remote sensing images,

K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context- augmented object detection in remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 4, pp. 2337–2348, Apr. 2018

work page 2018
[41]

Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,

G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 54, no. 12, pp. 7405–7415, Dec. 2016

work page 2016
[42]

Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,

G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 265–278, Jan. 2019

work page 2019
[43]

Efﬁcient saliency-based object detection in remote sensing images using deep belief networks,

W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu, “Efﬁcient saliency-based object detection in remote sensing images using deep belief networks,” IEEE Geosci. Remote Sens. Lett. , vol. 13, no. 2, pp. 137–141, Feb. 2016

work page 2016
[44]

Vehicle detection in aerial imagery: A small target detection benchmark,

S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” J. Vis. Commun. Image Represent., vol. 34, pp. 187–203, Jan. 2016

work page 2016
[45]

DOTA: A large-scale dataset for object detection in aerial images,

G. S. Xia et al. , “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. CVPR, 2018, pp. 3974–3983

work page 2018

[1] [1]

Convolutional Layer: Let I C(a, b) be an input image of size M × M; a ∈[ 1, M], b ∈[ 1, M] having C channels and f (·) is the ﬁlter with a kernel size h × h. The response of the convolutional layer ( conv) is computed by the following equation: Fd = C∑ j=1 f k(h) ∗ I n j + bk ⏐⏐⏐⏐ ⏐ ⏐ d k=1 (1) where bk is the bias, n ∈[ 1, M],a n d d is the ﬁlter depth. ...

work page

[2] [2]

These ConvRes residual features are studied at three different scales in the A VDNet

ConvRes Blocks: The response of a ConvRes block consisting of three conv layers is computed using the following equation: Fd ConvRes = Fd l (a, b) + Fd l−2(a, b) (3) where l is the current conv feature layer. These ConvRes residual features are studied at three different scales in the A VDNet. The 1× 1 conv response along with the leaky ReLu introduces an...

work page

[3] [3]

2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown

Recurrent-Feature Aware Visualization: In Fig. 2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown. For d feature maps in conv layer l, the RFA V representation is computed using the following equation: RFA V l(a, b) = arg maxz ( H(a,b) l (z) ) ; z ∈[ 0, 255] (4) where arg max (·) collects ...

work page

[4] [4]

These detailed features are very useful in the detection of small and dense objects

Feature Degradation Problem: Usually, the initial layers have detailed information as compared to the features at the deeper layers. These detailed features are very useful in the detection of small and dense objects. In order to preserve the small-sized object features, we designed residual feature blocks at multiple scales in the A VDNet. These residual...

work page

[5] [5]

This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers

Effect of Final Feature Map Resolution: The lower pixels-per-object values of the smaller objects cause the fea- tures to vanish in the deeper networks. This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers. For example, if the resolution of an image is 1024 × 1024, ...

work page

[6] [6]

This point was also reiterated by Lin et al

Higher Pixels-Per-Object V alues: We have made another observation that the input layer size inﬂuences the network’s capability to learn the features for the small-size objects. This point was also reiterated by Lin et al. [22] in RetinaNet where they used 600-pixel and 800-pixel image scale as input to the network to improve the detection performance. In...

work page

[7] [7]

In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories

VEDAI: VEDAI [27] data set contains aerial images captured from various scenario s for vehicle detection. In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories. The details of all the data sets (number of images, objects per class, etc.) are given in Table I

work page

[8] [8]

For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images

DLR-3K: DLR-3K [1] is mainly comprised of scenes from urban and residential areas. For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images. We have manually annotated all the images in DLR-3K and generated 8401 horizontally aligned bounding boxes for all the objects. Finally, we selected 262 images with ...

work page

[9] [9]

In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat

DOTA: DOTA [28] introduced a large-scale data set consisting of 2806 aerial images. In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat. Moreover, we manually annotated all the images in DOTA and generated 55 235 horizontally aligned bounding boxes as ground truth

work page

[10] [10]

The objects were annotated with four different classes: car, heavy vehicle, plane, and boat

Airborne Data Set (ABD) Data Set: We collected 79 new aerial images from online sources and generated a new data set named ABD by annotating 1396 objects for our experiments. The objects were annotated with four different classes: car, heavy vehicle, plane, and boat

work page

[11] [11]

The complete data set is categorized into four classes similar to the DOTA and ABD data set

Complete Data Set: For more comprehensive perfor- mance analysis of the proposed and existing object detectors in aerial scenes, we generated a large data set by combining VEDAI, DLR-3K, DOTA, and ABD data sets. The complete data set is categorized into four classes similar to the DOTA and ABD data set. The summary description of all the data sets is give...

work page

[12] [12]

The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes

Implementation Details: The entire method is imple- mented in darknet. The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes. The threshold is the minimum object conﬁdence score for which the network will detect an object. The object and class conﬁdence values are com...

work page

[13] [13]

The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively

Training Conﬁguration: Training is done over a Titan Xp GPU system with stochastic gradient descent optimizer and minibatch size = 4. The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively. The training loss is calculated by taking the sum of square error from the ﬁnal layer of the network, as given in [20]. We train our model wi...

work page

[14] [14]

The DLR-3K data set is divided with a ratio of ∼[80:20]

Model Training: We divide VEDAI, DOTA, Complete data set into train and test set with a ratio of ∼[90:10]. The DLR-3K data set is divided with a ratio of ∼[80:20]. The A VDNet is trained over each data set without using any pretrained weights. The A VDNet detector is trained for ∼30k iterations over VEDAI, DOTA, Complete data set, and ∼15k iterations over...

work page

[15] [15]

We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values

Quantitative Results: The performance measures of the proposed A VDNet and other state -of-the-art approaches for vehicle detection in VEDAI, DLR-3K, DOTA, and Complete data set is given in Table II. We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values. To ensure fair comparis...

work page

[16] [16]

Qualitative Results: We show the qualitative results of our approach to different challenging scenarios in Fig. 6. The detection responses from the original images are cropped out for appropriate visual representation. The A VDNet is able to detect vehicles, which are partially occluded by overhead building or trees, as shown in the ﬁrst row in Fig. 6. Si...

work page

[17] [17]

Complexity Analysis: The computation and space com- plexity of the proposed method and existing state-of-the-art techniques is given in Table III. We can see that the proposed method uses approximately 1/5, 2/9, 5/14 times smaller num- ber of parameters as compared to YOLO (v2 and v3), Faster R-CNN, and RetinaNet, respectively. Similarly, the proposed A V...

work page

[18] [18]

Fast multiclass vehicle detection on aer- ial images,

K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aer- ial images,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 9, pp. 1938–1942, Sep. 2015

work page 1938

[19] [19]

Detection of cars in high-resolution aerial images of complex urban environments,

M. ElMikaty and T. Stathaki, “Detection of cars in high-resolution aerial images of complex urban environments,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 10, pp. 5913–5924, Oct. 2017

work page 2017

[20] [20]

An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,

Y . Xu, G. Yu, X. Wu, Y . Wang, and Y . Ma, “An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,” IEEE Trans. Intell. Transp. Syst. , vol. 18, no. 7, pp. 1845–1856, Jul. 2017

work page 2017

[21] [21]

Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning

H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Nahavandi, “Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 12, pp. 7074–7085, Dec. 2018

work page 2018

[22] [22]

ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,

X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,” IEEE Trans. Geosci. Remote Sens. , to be published. doi: 10.1109/TGRS.2019.2897139

work page doi:10.1109/tgrs.2019.2897139 2019

[23] [23]

SLIC superpixels compared to state-of-the-art superpixel methods,

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 11, pp. 2274–2282, Nov. 2012

work page 2012

[24] [24]

VCells: Simple and efﬁcient superpixels using edge-weighted centroidal V oronoi tessellations,

J. Wang and X. Wang, “VCells: Simple and efﬁcient superpixels using edge-weighted centroidal V oronoi tessellations,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 6, pp. 1241–1247, Jun. 2012

work page 2012

[25] [25]

Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,

W. Zhang, X. Sun, K. Fu, C. Wa ng, and H. Wang, “Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,” IEEE Geosci. Remote Sens. Lett. , vol. 11, no. 1, pp. 74–78, Jan. 2014

work page 2014

[26] [26]

A S IFT-SVM method for detecting cars in UA V images,

T. Moranduzzo and F. Melgani, “A S IFT-SVM method for detecting cars in UA V images,” in Proc. IEEE IGARSS , Jul. 2012, pp. 6868–6871

work page 2012

[27] [27]

Vehicle detection in high-resolution aerial images based on fast sparse representation classiﬁcation and multiorder feature,

Z. Chen et al. , “Vehicle detection in high-resolution aerial images based on fast sparse representation classiﬁcation and multiorder feature,” IEEE Trans. Intell. Transp. Syst. , vol. 17, no. 8, pp. 2296–2309, Aug. 2016

work page 2016

[28] [28]

Vehicle detection in high-resolution aerial images via sparse representation and superpixels,

Z. Chen et al. , “Vehicle detection in high-resolution aerial images via sparse representation and superpixels,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 1, pp. 103–116, Jan. 2016

work page 2016

[29] [29]

Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,

Y . Yu, H. Guan, and Z. Ji, “Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 11, pp. 2183–2187, Nov. 2015

work page 2015

[30] [30]

Deep multi-modal vehicle detection in aerial ISR imagery,

W. Sakla, G. Konjevod, and T. N. Mundhenk, “Deep multi-modal vehicle detection in aerial ISR imagery,” in Proc. IEEE WACV, Mar. 2017, pp. 916–923

work page 2017

[31] [31]

Fast deep vehicle detection in aerial images,

L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep vehicle detection in aerial images,” in Proc. IEEE WACV, Mar. 2017, pp. 311–319

work page 2017

[32] [32]

Semantic labeling based vehicle detection in aerial imagery,

K. Nie, L. Sommer, A. Schumann, and J. Beyerer, “Semantic labeling based vehicle detection in aerial imagery,” in Proc. IEEE WACV, Mar. 2018, pp. 626–634

work page 2018

[33] [33]

Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,

Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. , vol. 10, no. 8, pp. 3652–3664, Aug. 2017

work page 2017

[34] [34]

Accurate object localization in remote sensing images based on convolutional neural networks,

Y . Long, Y . Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens. , vol. 55, no. 5, pp. 2486–2498, May 2017

work page 2017

[35] [35]

Faster R-CNN: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 6, pp. 1137–1149, Jun. 2017

work page 2017

[36] [36]

You only look once: Uniﬁed, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uniﬁed, real-time object detection,” in Pr oc. CVPR, 2016, pp. 779–788

work page 2016

[37] [37]

YOLO9000: Better, faster, stronger,

J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. CVPR, 2017, pp. 7263–7271

work page 2017

[38] [38]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- ment,” 2018, arXiv:1804.02767. [Online]. Available: https://arxiv. org/abs/1804.02767

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. ICCV, 2017, pp. 2980–2988

work page 2017

[40] [40]

Rotation-insensitive and context- augmented object detection in remote sensing images,

K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context- augmented object detection in remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 4, pp. 2337–2348, Apr. 2018

work page 2018

[41] [41]

Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,

G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 54, no. 12, pp. 7405–7415, Dec. 2016

work page 2016

[42] [42]

Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,

G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 265–278, Jan. 2019

work page 2019

[43] [43]

Efﬁcient saliency-based object detection in remote sensing images using deep belief networks,

W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu, “Efﬁcient saliency-based object detection in remote sensing images using deep belief networks,” IEEE Geosci. Remote Sens. Lett. , vol. 13, no. 2, pp. 137–141, Feb. 2016

work page 2016

[44] [44]

Vehicle detection in aerial imagery: A small target detection benchmark,

S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” J. Vis. Commun. Image Represent., vol. 34, pp. 187–203, Jan. 2016

work page 2016

[45] [45]

DOTA: A large-scale dataset for object detection in aerial images,

G. S. Xia et al. , “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. CVPR, 2018, pp. 3974–3983

work page 2018