AVDNet: A Small-Sized Vehicle Detection Network for Aerial Visual Data
Pith reviewed 2026-05-24 20:33 UTC · model grok-4.3
The pith
AVDNet uses multi-scale residual blocks to detect small vehicles in aerial images with less computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AVDNet is a one-stage vehicle detection network that places ConvRes residual blocks at multiple scales to counteract the loss of features for small objects that occurs with deeper convolutional layers. An enlarged output feature map works with these blocks to maintain robust representations of salient features for small-sized vehicles. The design is shown to raise mean average precision on VEDAI, DLR-3K, DOTA and the combined set that includes the new ABD collection while cutting both computation time and model size relative to existing techniques.
What carries the argument
ConvRes residual blocks inserted at multiple scales together with an enlarged output feature map that together preserve detail for small objects through the network.
If this is right
- One-stage detectors can maintain small-object performance without adding depth that erases fine detail.
- Lower computation and space complexity make the detector suitable for onboard aerial platforms.
- The RFAV visualization technique provides a way to inspect how residual connections affect feature retention in aerial scenes.
- A new annotated airborne dataset supplies additional examples of small vehicles for training and testing.
Where Pith is reading between the lines
- The same multi-scale residual pattern could be tested on other small-object tasks such as counting animals or infrastructure in satellite imagery.
- If the enlarged feature map proves decisive, pairing it with different backbone networks might produce further efficiency gains.
- Extending the static-image approach to video sequences would test whether motion cues add value beyond the spatial improvements shown.
Load-bearing premise
The reported gains in accuracy and efficiency come from the ConvRes blocks and enlarged feature map rather than from training choices or dataset quirks.
What would settle it
An ablation that removes the ConvRes blocks, keeps the same training protocol and datasets, and measures whether mean average precision drops on VEDAI or DOTA would falsify the claim if the drop is negligible.
Figures
read the original abstract
Detection of small-sized targets in aerial views is a challenging task due to the smallness of vehicle size, complex background, and monotonic object appearances. In this letter, we propose a one-stage vehicle detection network (AVDNet) to robustly detect small-sized vehicles in aerial scenes. In AVDNet, we introduced ConvRes residual blocks at multiple scales to alleviate the problem of vanishing features for smaller objects caused because of the inclusion of deeper convolutional layers. These residual blocks, along with enlarged output feature map, ensure the robust representation of the salient features for small sized objects. Furthermore, we proposed a recurrent-feature aware visualization (RFAV) technique to analyze the network behavior. We also created a new airborne image data set (ABD) by annotating 1396 new objects in 79 aerial images for our experiments. The effectiveness of AVDNet is validated on VEDAI, DLR- 3K, DOTA, and the combined (VEDAI, DLR-3K, DOTA, and ABD) data set. Experimental results demonstrate the significant performance improvement of the proposed method over state-of-the-art detection techniques in terms of mAP, computation, and space complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AVDNet, a one-stage detector for small vehicles in aerial imagery. It introduces ConvRes residual blocks at multiple scales plus an enlarged output feature map to mitigate vanishing features, a recurrent-feature aware visualization (RFAV) technique, and a new 79-image ABD dataset. Experiments on VEDAI, DLR-3K, DOTA and the combined set claim superior mAP together with lower computation and model size versus prior detectors.
Significance. If the reported gains can be shown to arise from the ConvRes blocks and enlarged feature map rather than from uncontrolled differences in training or baselines, the work would offer a compact architecture useful for real-time aerial vehicle detection. The new ABD set and RFAV tool are modest additional assets.
major comments (3)
- [Experiments] Experiments section: no ablation results are supplied that isolate the contribution of the ConvRes blocks or the enlarged output feature map (e.g., AVDNet minus ConvRes, or minus enlarged map). Without these controlled comparisons the headline claim that the architectural additions drive the mAP/complexity gains cannot be verified.
- [Experiments] Experiments section: the manuscript does not state that the cited state-of-the-art baselines were re-trained under identical optimizer, augmentation, learning-rate schedule and loss weighting as AVDNet. Observed deltas could therefore reflect implementation differences rather than the proposed components.
- [Dataset] Dataset section: the newly introduced ABD set contains only 79 images. When results are reported on the combined (VEDAI+DLR-3K+DOTA+ABD) collection, the small size of ABD limits the strength of any generalization claim.
minor comments (1)
- [Abstract] Abstract: 'DLR- 3K' contains an extraneous space before the numeral.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: Experiments section: no ablation results are supplied that isolate the contribution of the ConvRes blocks or the enlarged output feature map (e.g., AVDNet minus ConvRes, or minus enlarged map). Without these controlled comparisons the headline claim that the architectural additions drive the mAP/complexity gains cannot be verified.
Authors: We agree that dedicated ablation studies would more directly isolate the effects of the ConvRes residual blocks and the enlarged output feature map. The manuscript demonstrates overall gains via end-to-end comparisons against published baselines on four datasets, but these do not decompose the individual contributions. We will add controlled ablation experiments (AVDNet variants with and without each component) to the revised Experiments section. revision: yes
-
Referee: Experiments section: the manuscript does not state that the cited state-of-the-art baselines were re-trained under identical optimizer, augmentation, learning-rate schedule and loss weighting as AVDNet. Observed deltas could therefore reflect implementation differences rather than the proposed components.
Authors: Baseline numbers are taken from the original publications, as is conventional when full re-implementation details are unavailable. Our own training protocol for AVDNet is described in detail. We recognize that this leaves room for implementation variance and will re-train the primary baselines (e.g., YOLO, SSD variants) under identical settings for the revision to enable a fairer head-to-head comparison. revision: yes
-
Referee: Dataset section: the newly introduced ABD set contains only 79 images. When results are reported on the combined (VEDAI+DLR-3K+DOTA+ABD) collection, the small size of ABD limits the strength of any generalization claim.
Authors: ABD (79 images, 1396 objects) was introduced as a modest supplementary set to increase scene diversity rather than as a large-scale benchmark. Individual results on VEDAI, DLR-3K and DOTA are also reported separately. We will revise the text to explicitly frame ABD as an auxiliary validation set and temper any generalization language regarding the combined collection. revision: partial
Circularity Check
No circularity: experimental claims rest on direct dataset comparisons without internal reductions
full rationale
The manuscript proposes AVDNet with ConvRes blocks and an enlarged feature map, then reports mAP and complexity numbers on VEDAI/DLR-3K/DOTA/ABD. No equations, fitted parameters, or predictions appear; the central claim is an empirical performance delta against prior detectors. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the architecture. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- network scale factors and block placement
axioms (1)
- domain assumption Residual connections alleviate vanishing features for small objects in deeper layers
invented entities (2)
-
ConvRes residual blocks
no independent evidence
-
RFAV visualization technique
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Convolutional Layer: Let I C(a, b) be an input image of size M × M; a ∈[ 1, M], b ∈[ 1, M] having C channels and f (·) is the filter with a kernel size h × h. The response of the convolutional layer ( conv) is computed by the following equation: Fd = C∑ j=1 f k(h) ∗ I n j + bk ⏐⏐⏐⏐ ⏐ ⏐ d k=1 (1) where bk is the bias, n ∈[ 1, M],a n d d is the filter depth. ...
-
[2]
These ConvRes residual features are studied at three different scales in the A VDNet
ConvRes Blocks: The response of a ConvRes block consisting of three conv layers is computed using the following equation: Fd ConvRes = Fd l (a, b) + Fd l−2(a, b) (3) where l is the current conv feature layer. These ConvRes residual features are studied at three different scales in the A VDNet. The 1× 1 conv response along with the leaky ReLu introduces an...
-
[3]
Recurrent-Feature Aware Visualization: In Fig. 2, the composite visual representation of the multiple feature maps generated at the end of a conv operation is shown. For d feature maps in conv layer l, the RFA V representation is computed using the following equation: RFA V l(a, b) = arg maxz ( H(a,b) l (z) ) ; z ∈[ 0, 255] (4) where arg max (·) collects ...
-
[4]
These detailed features are very useful in the detection of small and dense objects
Feature Degradation Problem: Usually, the initial layers have detailed information as compared to the features at the deeper layers. These detailed features are very useful in the detection of small and dense objects. In order to preserve the small-sized object features, we designed residual feature blocks at multiple scales in the A VDNet. These residual...
-
[5]
Effect of Final Feature Map Resolution: The lower pixels-per-object values of the smaller objects cause the fea- tures to vanish in the deeper networks. This is in contrast to the features of bigger objects with higher pixel-per-object values, which are clearly det ected in the deeper CNN layers. For example, if the resolution of an image is 1024 × 1024, ...
-
[6]
This point was also reiterated by Lin et al
Higher Pixels-Per-Object V alues: We have made another observation that the input layer size influences the network’s capability to learn the features for the small-size objects. This point was also reiterated by Lin et al. [22] in RetinaNet where they used 600-pixel and 800-pixel image scale as input to the network to improve the detection performance. In...
-
[7]
VEDAI: VEDAI [27] data set contains aerial images captured from various scenario s for vehicle detection. In our TABLE I SUMMARIZATION OF THE EV ALUATED DATA SETS experiments, we have trained our proposed A VDNet for 11 vehicle categories. The details of all the data sets (number of images, objects per class, etc.) are given in Table I
-
[8]
DLR-3K: DLR-3K [1] is mainly comprised of scenes from urban and residential areas. For our experiments, we have divided each image (total of 20 images) into 16 parts to gen- erate 320 images. We have manually annotated all the images in DLR-3K and generated 8401 horizontally aligned bounding boxes for all the objects. Finally, we selected 262 images with ...
-
[9]
DOTA: DOTA [28] introduced a large-scale data set consisting of 2806 aerial images. In our experiments, we have represented these objects through four categories: car, heavy vehicle, plane, and boat. Moreover, we manually annotated all the images in DOTA and generated 55 235 horizontally aligned bounding boxes as ground truth
-
[10]
The objects were annotated with four different classes: car, heavy vehicle, plane, and boat
Airborne Data Set (ABD) Data Set: We collected 79 new aerial images from online sources and generated a new data set named ABD by annotating 1396 objects for our experiments. The objects were annotated with four different classes: car, heavy vehicle, plane, and boat
-
[11]
The complete data set is categorized into four classes similar to the DOTA and ABD data set
Complete Data Set: For more comprehensive perfor- mance analysis of the proposed and existing object detectors in aerial scenes, we generated a large data set by combining VEDAI, DLR-3K, DOTA, and ABD data sets. The complete data set is categorized into four classes similar to the DOTA and ABD data set. The summary description of all the data sets is give...
-
[12]
Implementation Details: The entire method is imple- mented in darknet. The detection results of the A VDNet depend on various parameters, such as intersection over union (IoU) thresholds and number of anchor boxes. The threshold is the minimum object confidence score for which the network will detect an object. The object and class confidence values are com...
-
[13]
The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively
Training Configuration: Training is done over a Titan Xp GPU system with stochastic gradient descent optimizer and minibatch size = 4. The weight decay and momentum parameters are set to 0.0005 and 0.9, respectively. The training loss is calculated by taking the sum of square error from the final layer of the network, as given in [20]. We train our model wi...
-
[14]
The DLR-3K data set is divided with a ratio of ∼[80:20]
Model Training: We divide VEDAI, DOTA, Complete data set into train and test set with a ratio of ∼[90:10]. The DLR-3K data set is divided with a ratio of ∼[80:20]. The A VDNet is trained over each data set without using any pretrained weights. The A VDNet detector is trained for ∼30k iterations over VEDAI, DOTA, Complete data set, and ∼15k iterations over...
-
[15]
Quantitative Results: The performance measures of the proposed A VDNet and other state -of-the-art approaches for vehicle detection in VEDAI, DLR-3K, DOTA, and Complete data set is given in Table II. We compare different methods in terms of mAP, which corresponds to the average of the maximum precisions at differen t recall values. To ensure fair comparis...
-
[16]
Qualitative Results: We show the qualitative results of our approach to different challenging scenarios in Fig. 6. The detection responses from the original images are cropped out for appropriate visual representation. The A VDNet is able to detect vehicles, which are partially occluded by overhead building or trees, as shown in the first row in Fig. 6. Si...
-
[17]
Complexity Analysis: The computation and space com- plexity of the proposed method and existing state-of-the-art techniques is given in Table III. We can see that the proposed method uses approximately 1/5, 2/9, 5/14 times smaller num- ber of parameters as compared to YOLO (v2 and v3), Faster R-CNN, and RetinaNet, respectively. Similarly, the proposed A V...
-
[18]
Fast multiclass vehicle detection on aer- ial images,
K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aer- ial images,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 9, pp. 1938–1942, Sep. 2015
work page 1938
-
[19]
Detection of cars in high-resolution aerial images of complex urban environments,
M. ElMikaty and T. Stathaki, “Detection of cars in high-resolution aerial images of complex urban environments,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 10, pp. 5913–5924, Oct. 2017
work page 2017
-
[20]
An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,
Y . Xu, G. Yu, X. Wu, Y . Wang, and Y . Ma, “An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery,” IEEE Trans. Intell. Transp. Syst. , vol. 18, no. 7, pp. 1845–1856, Jul. 2017
work page 2017
-
[21]
Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning
H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Nahavandi, “Robust vehicle detection in aerial images using bag-of-words and orientation aware scanning” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 12, pp. 7074–7085, Dec. 2018
work page 2018
-
[22]
X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “ORSIm detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,” IEEE Trans. Geosci. Remote Sens. , to be published. doi: 10.1109/TGRS.2019.2897139
-
[23]
SLIC superpixels compared to state-of-the-art superpixel methods,
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 11, pp. 2274–2282, Nov. 2012
work page 2012
-
[24]
VCells: Simple and efficient superpixels using edge-weighted centroidal V oronoi tessellations,
J. Wang and X. Wang, “VCells: Simple and efficient superpixels using edge-weighted centroidal V oronoi tessellations,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 6, pp. 1241–1247, Jun. 2012
work page 2012
-
[25]
W. Zhang, X. Sun, K. Fu, C. Wa ng, and H. Wang, “Object detec- tion in high-resolution remote sensing images using rotation invariant parts based model,” IEEE Geosci. Remote Sens. Lett. , vol. 11, no. 1, pp. 74–78, Jan. 2014
work page 2014
-
[26]
A S IFT-SVM method for detecting cars in UA V images,
T. Moranduzzo and F. Melgani, “A S IFT-SVM method for detecting cars in UA V images,” in Proc. IEEE IGARSS , Jul. 2012, pp. 6868–6871
work page 2012
-
[27]
Z. Chen et al. , “Vehicle detection in high-resolution aerial images based on fast sparse representation classification and multiorder feature,” IEEE Trans. Intell. Transp. Syst. , vol. 17, no. 8, pp. 2296–2309, Aug. 2016
work page 2016
-
[28]
Vehicle detection in high-resolution aerial images via sparse representation and superpixels,
Z. Chen et al. , “Vehicle detection in high-resolution aerial images via sparse representation and superpixels,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 1, pp. 103–116, Jan. 2016
work page 2016
-
[29]
Y . Yu, H. Guan, and Z. Ji, “Rotati on-invariant object detection in high- resolution satellite imagery using superpixel-based deep Hough forests,” IEEE Geosci. Remote Sens. Lett. , vol. 12, no. 11, pp. 2183–2187, Nov. 2015
work page 2015
-
[30]
Deep multi-modal vehicle detection in aerial ISR imagery,
W. Sakla, G. Konjevod, and T. N. Mundhenk, “Deep multi-modal vehicle detection in aerial ISR imagery,” in Proc. IEEE WACV, Mar. 2017, pp. 916–923
work page 2017
-
[31]
Fast deep vehicle detection in aerial images,
L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep vehicle detection in aerial images,” in Proc. IEEE WACV, Mar. 2017, pp. 311–319
work page 2017
-
[32]
Semantic labeling based vehicle detection in aerial imagery,
K. Nie, L. Sommer, A. Schumann, and J. Beyerer, “Semantic labeling based vehicle detection in aerial imagery,” in Proc. IEEE WACV, Mar. 2018, pp. 626–634
work page 2018
-
[33]
Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. , vol. 10, no. 8, pp. 3652–3664, Aug. 2017
work page 2017
-
[34]
Accurate object localization in remote sensing images based on convolutional neural networks,
Y . Long, Y . Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens. , vol. 55, no. 5, pp. 2486–2498, May 2017
work page 2017
-
[35]
Faster R-CNN: Towards real-time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 6, pp. 1137–1149, Jun. 2017
work page 2017
-
[36]
You only look once: Unified, real-time object detection,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Pr oc. CVPR, 2016, pp. 779–788
work page 2016
-
[37]
YOLO9000: Better, faster, stronger,
J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. CVPR, 2017, pp. 7263–7271
work page 2017
-
[38]
YOLOv3: An Incremental Improvement
J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- ment,” 2018, arXiv:1804.02767. [Online]. Available: https://arxiv. org/abs/1804.02767
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. ICCV, 2017, pp. 2980–2988
work page 2017
-
[40]
Rotation-insensitive and context- augmented object detection in remote sensing images,
K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context- augmented object detection in remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 56, no. 4, pp. 2337–2348, Apr. 2018
work page 2018
-
[41]
G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo- lutional neural networks for object detection in VHR optical remote sensing images,” IEEE Trans. Geosci. Remote Sens. , vol. 54, no. 12, pp. 7405–7415, Dec. 2016
work page 2016
-
[42]
G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant and Fisher discriminative convolutional neural networks for object detection,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 265–278, Jan. 2019
work page 2019
-
[43]
Efficient saliency-based object detection in remote sensing images using deep belief networks,
W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu, “Efficient saliency-based object detection in remote sensing images using deep belief networks,” IEEE Geosci. Remote Sens. Lett. , vol. 13, no. 2, pp. 137–141, Feb. 2016
work page 2016
-
[44]
Vehicle detection in aerial imagery: A small target detection benchmark,
S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” J. Vis. Commun. Image Represent., vol. 34, pp. 187–203, Jan. 2016
work page 2016
-
[45]
DOTA: A large-scale dataset for object detection in aerial images,
G. S. Xia et al. , “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. CVPR, 2018, pp. 3974–3983
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.