pith. sign in

arxiv: 2510.02213 · v2 · submitted 2025-10-02 · 💻 cs.CV

Getting the Numbers Rightunicode{x2014}Modelling Multi-Class Object Counting in Dense and Varied Scenes

Pith reviewed 2026-05-18 10:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-class density estimationobject countingvision transformerVisDroneiSAIDcrowd countingauxiliary segmentation
0
0 comments X

The pith

A vision transformer backbone with a training-only category focus module enables accurate multi-class object counting in both dense and sparse scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a density estimation method that counts objects from multiple classes even when scenes range from nearly empty to heavily crowded and occluded. It pairs a pyramid vision transformer for multi-scale feature extraction with a CNN decoder and adds an auxiliary segmentation task only during training via the Category Focus Module to reduce class confusion. Results on VisDrone and iSAID show large drops in mean absolute error versus earlier multi-class density methods and better results than YOLO11 especially in the busiest images. A reader would care because this could support reliable automated counting in real settings like traffic, agriculture, or public spaces without switching between separate models for different crowd levels.

Core claim

The paper claims that a Twins-SVT pyramid vision transformer backbone combined with a multiscale CNN decoder and a Category Focus Module for an auxiliary segmentation task applied only at training time produces multi-class density maps that remain accurate across wide density variations, delivering 33 percent, 43 percent, and 64 percent reductions in MAE on the VisDrone and iSAID test sets while outperforming YOLO11 by an order of magnitude in the most crowded samples.

What carries the argument

The Category Focus Module, which applies an auxiliary segmentation task during training to suppress inter-category interference in the density estimation head without using the task or its constraints at inference time.

If this is right

  • Multi-class density estimation no longer needs to trade off performance between dense and sparse scenes.
  • The method bridges the gap where prior density estimators degrade in low-density images and detectors like YOLO11 lose accuracy in high-density images.
  • Practical counting systems can use a single model for varied real-world conditions without auxiliary tasks at test time.
  • Error reductions of one-third to two-thirds suggest measurable gains in applications that rely on accurate class-specific counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The non-exclusive class modeling could transfer to other vision tasks that must handle overlapping or ambiguous categories.
  • Combining the approach with temporal information from video could extend robust counting to dynamic scenes.
  • Further scaling the transformer backbone might yield additional gains on larger or more varied datasets.

Load-bearing premise

The auxiliary segmentation task can improve the density estimation head during training without introducing biases that would require applying the same task or its assumptions at inference time.

What would settle it

Evaluating the model on a held-out dataset containing object categories or density extremes absent from VisDrone and iSAID would show whether the reported error reductions and cross-density robustness persist.

Figures

Figures reproduced from arXiv: 2510.02213 by Georgios Leontidis, James M. Brown, Jonathan Cox, Marc Hanheide, Petra Bosilj, Villanelle O'Reilly.

Figure 1
Figure 1. Figure 1: Multipurpose Multi-class Density Estimation. Testing results from our multicategory crowd counting method applied to the Hicks et al. [12], VisDrone-DET[34] and iSAID[29] datasets. dicted (e.g. Liang et al. [17]), and density estimation, which provides “weak” localisation in the form of a heatmap (e.g. Dong et al. [5], Liu et al. [19]). Multi-class density map estimation produces one density arXiv:2510.022… view at source ↗
Figure 2
Figure 2. Figure 2: Class Distribution. In multi-class density estima￾tion, each class represents a distinct counting task, so an im￾balance in how often classes appear can strongly influence gra￾dients, if most of the counting tasks are optimal at zero for a given sample. The plot illustrates the number of different classes present in each image. VisDrone-DET (both the 8- and 10-class versions) and iSAID images typically con… view at source ↗
Figure 3
Figure 3. Figure 3: Our Model Architecture. Within the Multiscale Aware Module, a concept from Yu and Hu [32], although used differently here, the first column of convolutions is followed immediately by column of a batch norm and ReLU activations. The Category Focus Module (CFM) is an extension of the MAM with one addi￾tional Conv → Convdilated row with a dilation of 4. However, the authors combine a region proposal network a… view at source ↗
Figure 4
Figure 4. Figure 4: Model Heads. The two output heads of the model. L ′ 2 = X C c=1 Xm i=1 Xn j=1 (Q ′pred c,i,j − Q ′gt c,i,j ) 2 (4) Where wr is a weighting between the two terms, and that Q′ c , the inverse of Qc, is scaled to have an equal mean value, so the terms can be proportionally weighted. The segmentation head employs a per-category cross￾entropy loss: Lmask = 1 C X C c=1 LCE(Q pred c , Qgt c ) (5) The losses are c… view at source ↗
read the original abstract

Density map estimation enables accurate object counting in heavily occluded, and densely packed scenes where detection-based counting fails. In multi-class density estimation, class awareness can be introduced by modelling classes non-exclusively, better reflecting crowded and visually ambiguous contexts. However, existing multi-class density estimators often degrade in less-dense scenes, while state-of-the-art detectors still struggle in the most congested settings. To bridge this gap, we propose the first vision-transformer-based approach to multi-class density estimation. Our model combines a Twins-SVT pyramid vision transformer backbone with a multiscale CNN decoder that leverages hierarchical features for robust counting across a wide range of densities. Further to that, the method adds an auxiliary segmentation task with the Category Focus Module to suppress inter-category interference at training time. The module improves the density estimation head without the need for constraining assumptions added by the application of the auxiliary task at inference time, as required in previous methods. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates a leap in performance versus the previous state-of-the-art multi-class density estimation methods, attaining a 33%, 43%, and 64% reduction to MAE in testing evaluation. The method outperforms YOLO11 in less busy scenes, exceeding it by an order of magnitude in the most crowded testing samples. Code, and trained weights available at https://github.com/LCAS/gnr_mcdest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes the first vision-transformer-based multi-class density estimation approach for object counting. It combines a Twins-SVT pyramid backbone with a multiscale CNN decoder and introduces a Category Focus Module that applies an auxiliary segmentation task solely during training to suppress inter-category interference. The method is evaluated on VisDrone and iSAID, claiming 33%, 43%, and 64% MAE reductions versus prior multi-class density estimators, plus outperformance of YOLO11 (by an order of magnitude in the densest samples) without imposing inference-time constraints from the auxiliary task. Public code and weights are released.

Significance. If the central performance claims hold after verification of the training-only decoupling, the work would meaningfully bridge density estimation (strong in dense/occluded scenes) and detection (strong in sparse scenes) for multi-class counting across density regimes. The public code release and trained weights are a clear strength supporting reproducibility and follow-up work.

major comments (3)
  1. [§4] §4 (Experiments) and Table 2: The 33/43/64% MAE reductions versus prior multi-class density estimators are reported without standard deviations across runs, without p-values, and without an explicit list of the exact prior methods and their reported numbers; this makes the 'leap in performance' claim difficult to verify as load-bearing for the central contribution.
  2. [§3.2] §3.2 (Category Focus Module): The assertion that the auxiliary segmentation loss improves the density head exclusively at training time (with no residual coupling via shared backbone features or loss weighting) is not supported by an ablation that evaluates the density head both with and without the module at inference; without this isolation, the comparison to YOLO11 in crowded samples and the claim of avoiding prior methods' inference constraints remain insecure.
  3. [§4.3] §4.3 (Comparison with detectors): The order-of-magnitude MAE improvement over YOLO11 in the most crowded test samples is presented without defining the density threshold used to select those samples or reporting the number of such samples; this detail is required to assess whether the result generalizes or is driven by a small subset.
minor comments (3)
  1. [Abstract] The abstract states '33%, 43%, and 64% reduction to MAE' but does not name the three prior methods or the corresponding absolute MAE values; adding these would improve clarity.
  2. [Figure 3] Figure 3 (architecture diagram) would benefit from an explicit inference-time path annotation showing that the Category Focus Module is removed, to visually support the decoupling claim.
  3. [§3.1] The Twins-SVT backbone citation is present but the exact variant (e.g., Twins-SVT-B) and input resolution used should be stated in §3.1 for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve clarity and verifiability of the results.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and Table 2: The 33/43/64% MAE reductions versus prior multi-class density estimators are reported without standard deviations across runs, without p-values, and without an explicit list of the exact prior methods and their reported numbers; this makes the 'leap in performance' claim difficult to verify as load-bearing for the central contribution.

    Authors: We agree that additional statistical detail would strengthen the presentation. In the revised manuscript we will expand Table 2 to list the exact MAE values reported by each prior multi-class density estimator (with citations), compute the precise percentage reductions from those numbers, and add results from multiple independent training runs (different random seeds) reporting mean MAE together with standard deviations. We will also include paired statistical significance tests and the corresponding p-values for the main comparisons. revision: yes

  2. Referee: [§3.2] §3.2 (Category Focus Module): The assertion that the auxiliary segmentation loss improves the density head exclusively at training time (with no residual coupling via shared backbone features or loss weighting) is not supported by an ablation that evaluates the density head both with and without the module at inference; without this isolation, the comparison to YOLO11 in crowded samples and the claim of avoiding prior methods' inference constraints remain insecure.

    Authors: The Category Focus Module applies an auxiliary segmentation loss only during training; at inference the module is completely removed and the density head operates without any segmentation output or extra constraints. To make this isolation explicit we will add an ablation (new Table or figure in §3.2 and §4) that compares density-estimation performance of models trained with versus without the Category Focus Module. Because the module is discarded after training, no inference-time ablation of the module itself is possible or necessary; the new training-only ablation directly quantifies the benefit to the density head while confirming zero inference overhead. revision: yes

  3. Referee: [§4.3] §4.3 (Comparison with detectors): The order-of-magnitude MAE improvement over YOLO11 in the most crowded test samples is presented without defining the density threshold used to select those samples or reporting the number of such samples; this detail is required to assess whether the result generalizes or is driven by a small subset.

    Authors: We accept that the selection criterion must be stated explicitly. In the revised §4.3 we will define the density threshold (e.g., images whose object count exceeds the 90th percentile of the test-set distribution or a concrete per-pixel density value) used to isolate the most crowded samples and will report the exact number of test images satisfying the criterion. This information will allow readers to judge the scope of the reported improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical multi-class density estimation model

full rationale

The paper proposes an empirical architecture combining a Twins-SVT vision transformer backbone with a multiscale CNN decoder and a training-only auxiliary segmentation task via the Category Focus Module. Performance claims consist of reported MAE reductions (33/43/64%) on VisDrone and iSAID benchmarks plus comparisons to prior density estimators and YOLO11; these are direct experimental outcomes rather than derivations. No equations, first-principles results, or predictions appear that reduce to fitted inputs by construction. The auxiliary task is explicitly described as improving the density head at training time without inference constraints, presented as a design distinction from prior work rather than a self-referential reduction. Public code and weights further enable external verification. The derivation chain is self-contained as standard model design plus benchmark evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard density-map counting assumptions plus a newly introduced training module whose benefit is demonstrated only empirically on two benchmarks.

free parameters (1)
  • Training hyperparameters and loss weights
    Typical deep learning model choices not detailed in abstract but required for reproduction.
axioms (1)
  • domain assumption Density map summation accurately yields object counts even under heavy occlusion and class ambiguity
    Foundational premise stated in the abstract for why density estimation is used instead of detection.
invented entities (1)
  • Category Focus Module no independent evidence
    purpose: Suppress inter-category interference during training only
    New component introduced to improve the density head without inference-time constraints.

pith-pipeline@v0.9.0 · 5807 in / 1296 out tokens · 66005 ms · 2026-05-18T10:13:28.878129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Breeze, Alison P

    Tom D. Breeze, Alison P. Bailey, Kelvin G. Balcombe, Tom Brereton, Richard Comont, Mike Edwards, Michael P. Gar- ratt, Martin Harvey, Cathy Hawes, Nick Isaac, Mark Jitlal, Catherine M. Jones, William E. Kunin, Paul Lee, Roger K. A. Morris, Andy Musgrove, Rory S. O’Connor, Jodey Peyton, Simon G. Potts, Stuart P. M. Roberts, David B. Roy, Helen E. Roy, Cuon...

  2. [2]

    Scale aggregation network for accurate and efficient crowd count- ing

    Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd count- ing. InProceedings of the European Conference on Com- puter Vision (ECCV), 2018. 2

  3. [3]

    Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

    Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 2, 3, 4

  4. [4]

    Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

    Mingpeng Cui, Guanchen Ding, Daiqin Yang, and Zhen- zhong Chen. Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024. 2, 7, 8

  5. [5]

    Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024

    Li Dong, Haijun Zhang, Dongliang Zhou, Jianyang Shi, and Jianghong Ma. Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024. 1, 3, 4

  6. [6]

    Centernet: Keypoint triplets for object detection

    Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019. 3

  7. [7]

    Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,

    Qiyan Fu, Weidong Min, Weixiang Sheng, and Chunjiang Peng. Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,

  8. [8]

    Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021

    Guangshuai Gao, Qingjie Liu, and Yunhong Wang. Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021. 2

  9. [9]

    Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024

    Junyu Gao, Liangliang Zhao, and Xuelong Li. Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024. 2

  10. [10]

    Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021

    Adrian Salazar Gomez, Erchan Aptoula, Simon Parsons, and Petra Bosilj. Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021. 2

  11. [11]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 3

  12. [12]

    Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021

    Damien Hicks, Mathilde Baude, Christoph Kratz, Pierre Ou- vrard, and Graham Stone. Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021. 1, 2, 4, 5, 7

  13. [13]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2, 6, 7

  14. [14]

    Yolov11: An overview of the key architectural enhancements, 2024

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements, 2024. 2

  15. [15]

    Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M

    Falko Lavitt, Demi J. Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M. Tomczak. Deep learning and transfer learning for automatic cell counting in microscope images of human cancer cell lines.Applied Sci- ences, 11(11), 2021. 1

  16. [16]

    Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes

    Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  17. [17]

    Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022

    Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022. 1, 4

  18. [18]

    Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025

    Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, and Deyu Meng. Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025. 4

  19. [19]

    Context- aware crowd counting

    Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context- aware crowd counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1, 2

  20. [20]

    Bayesian loss for crowd count estimation with point super- vision

    Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point super- vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019. 2

  21. [21]

    Class-aware object counting

    Andreas Michel, Wolfgang Gross, Fabian Schenkel, and Wolfgang Middelmann. Class-aware object counting. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision (WACV) Workshops, pages 469– 478, 2022. 2, 3, 4, 5, 6, 7, 8

  22. [22]

    Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020

    Nuri Erkin Ocer, Gordana Kaplan, Firat Erdem, Dilek Ku- cuk Matci, and Ugur Avdan. Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020. 1

  23. [23]

    Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution

    Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10213–10224, 2021. 3

  24. [24]

    Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 3

  25. [25]

    A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019

    Jie Shen, Xin Xiong, Zhiyuan Xue, and Yinglong Bian. A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019. 1

  26. [26]

    Very deep convo- lutional networks for large-scale image recognition, 2015

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition, 2015. 2, 4

  27. [27]

    Cc- trans: Simplifying and improving crowd counting with trans- former, 2021

    Ye Tian, Xiangxiang Chu, and Hongpeng Wang. Cc- trans: Simplifying and improving crowd counting with trans- former, 2021. 4

  28. [28]

    Greenhouse gas reporting: conversion factors 2025, 2025

    UK Gov’t Department for Energy Security and Net Zero. Greenhouse gas reporting: conversion factors 2025, 2025. [Online; accessed 07-September-2025]. 8

  29. [29]

    isaid: A large-scale dataset for instance segmentation in aerial images

    Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019. 1, 2, 5, 7

  30. [30]

    Dota: A large-scale dataset for object detection in aerial images

    Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 5

  31. [31]

    Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021

    Wei Xu, Dingkang Liang, Yixiao Zheng, Jiahao Xie, and Zhanyu Ma. Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021. 2, 3, 4, 5, 6, 7, 8

  32. [32]

    Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,

    Jiamao Yu and Hexuan Hu. Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,

  33. [33]

    Single-image crowd counting via multi-column convolutional neural network

    Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

  34. [34]

    Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021

    Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021. 1, 2, 5, 6, 7

  35. [35]

    Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

    Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 2