Getting the Numbers Right$\unicode{x2014}$Modelling Multi-Class Object Counting in Dense and Varied Scenes

Georgios Leontidis; James M. Brown; Jonathan Cox; Marc Hanheide; Petra Bosilj; Villanelle O'Reilly

arxiv: 2510.02213 · v2 · submitted 2025-10-02 · 💻 cs.CV

Getting the Numbers Rightunicode{x2014}Modelling Multi-Class Object Counting in Dense and Varied Scenes

Villanelle O'Reilly , Jonathan Cox , Georgios Leontidis , Marc Hanheide , Petra Bosilj , James M. Brown This is my paper

Pith reviewed 2026-05-18 10:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-class density estimationobject countingvision transformerVisDroneiSAIDcrowd countingauxiliary segmentation

0 comments

The pith

A vision transformer backbone with a training-only category focus module enables accurate multi-class object counting in both dense and sparse scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a density estimation method that counts objects from multiple classes even when scenes range from nearly empty to heavily crowded and occluded. It pairs a pyramid vision transformer for multi-scale feature extraction with a CNN decoder and adds an auxiliary segmentation task only during training via the Category Focus Module to reduce class confusion. Results on VisDrone and iSAID show large drops in mean absolute error versus earlier multi-class density methods and better results than YOLO11 especially in the busiest images. A reader would care because this could support reliable automated counting in real settings like traffic, agriculture, or public spaces without switching between separate models for different crowd levels.

Core claim

The paper claims that a Twins-SVT pyramid vision transformer backbone combined with a multiscale CNN decoder and a Category Focus Module for an auxiliary segmentation task applied only at training time produces multi-class density maps that remain accurate across wide density variations, delivering 33 percent, 43 percent, and 64 percent reductions in MAE on the VisDrone and iSAID test sets while outperforming YOLO11 by an order of magnitude in the most crowded samples.

What carries the argument

The Category Focus Module, which applies an auxiliary segmentation task during training to suppress inter-category interference in the density estimation head without using the task or its constraints at inference time.

If this is right

Multi-class density estimation no longer needs to trade off performance between dense and sparse scenes.
The method bridges the gap where prior density estimators degrade in low-density images and detectors like YOLO11 lose accuracy in high-density images.
Practical counting systems can use a single model for varied real-world conditions without auxiliary tasks at test time.
Error reductions of one-third to two-thirds suggest measurable gains in applications that rely on accurate class-specific counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The non-exclusive class modeling could transfer to other vision tasks that must handle overlapping or ambiguous categories.
Combining the approach with temporal information from video could extend robust counting to dynamic scenes.
Further scaling the transformer backbone might yield additional gains on larger or more varied datasets.

Load-bearing premise

The auxiliary segmentation task can improve the density estimation head during training without introducing biases that would require applying the same task or its assumptions at inference time.

What would settle it

Evaluating the model on a held-out dataset containing object categories or density extremes absent from VisDrone and iSAID would show whether the reported error reductions and cross-density robustness persist.

Figures

Figures reproduced from arXiv: 2510.02213 by Georgios Leontidis, James M. Brown, Jonathan Cox, Marc Hanheide, Petra Bosilj, Villanelle O'Reilly.

**Figure 1.** Figure 1: Multipurpose Multi-class Density Estimation. Testing results from our multicategory crowd counting method applied to the Hicks et al. [12], VisDrone-DET[34] and iSAID[29] datasets. dicted (e.g. Liang et al. [17]), and density estimation, which provides “weak” localisation in the form of a heatmap (e.g. Dong et al. [5], Liu et al. [19]). Multi-class density map estimation produces one density arXiv:2510.022… view at source ↗

**Figure 2.** Figure 2: Class Distribution. In multi-class density estimation, each class represents a distinct counting task, so an imbalance in how often classes appear can strongly influence gradients, if most of the counting tasks are optimal at zero for a given sample. The plot illustrates the number of different classes present in each image. VisDrone-DET (both the 8- and 10-class versions) and iSAID images typically con… view at source ↗

**Figure 3.** Figure 3: Our Model Architecture. Within the Multiscale Aware Module, a concept from Yu and Hu [32], although used differently here, the first column of convolutions is followed immediately by column of a batch norm and ReLU activations. The Category Focus Module (CFM) is an extension of the MAM with one additional Conv → Convdilated row with a dilation of 4. However, the authors combine a region proposal network a… view at source ↗

**Figure 4.** Figure 4: Model Heads. The two output heads of the model. L ′ 2 = X C c=1 Xm i=1 Xn j=1 (Q ′pred c,i,j − Q ′gt c,i,j ) 2 (4) Where wr is a weighting between the two terms, and that Q′ c , the inverse of Qc, is scaled to have an equal mean value, so the terms can be proportionally weighted. The segmentation head employs a per-category crossentropy loss: Lmask = 1 C X C c=1 LCE(Q pred c , Qgt c ) (5) The losses are c… view at source ↗

read the original abstract

Density map estimation enables accurate object counting in heavily occluded, and densely packed scenes where detection-based counting fails. In multi-class density estimation, class awareness can be introduced by modelling classes non-exclusively, better reflecting crowded and visually ambiguous contexts. However, existing multi-class density estimators often degrade in less-dense scenes, while state-of-the-art detectors still struggle in the most congested settings. To bridge this gap, we propose the first vision-transformer-based approach to multi-class density estimation. Our model combines a Twins-SVT pyramid vision transformer backbone with a multiscale CNN decoder that leverages hierarchical features for robust counting across a wide range of densities. Further to that, the method adds an auxiliary segmentation task with the Category Focus Module to suppress inter-category interference at training time. The module improves the density estimation head without the need for constraining assumptions added by the application of the auxiliary task at inference time, as required in previous methods. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates a leap in performance versus the previous state-of-the-art multi-class density estimation methods, attaining a 33%, 43%, and 64% reduction to MAE in testing evaluation. The method outperforms YOLO11 in less busy scenes, exceeding it by an order of magnitude in the most crowded testing samples. Code, and trained weights available at https://github.com/LCAS/gnr_mcdest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes the first vision-transformer-based multi-class density estimation approach for object counting. It combines a Twins-SVT pyramid backbone with a multiscale CNN decoder and introduces a Category Focus Module that applies an auxiliary segmentation task solely during training to suppress inter-category interference. The method is evaluated on VisDrone and iSAID, claiming 33%, 43%, and 64% MAE reductions versus prior multi-class density estimators, plus outperformance of YOLO11 (by an order of magnitude in the densest samples) without imposing inference-time constraints from the auxiliary task. Public code and weights are released.

Significance. If the central performance claims hold after verification of the training-only decoupling, the work would meaningfully bridge density estimation (strong in dense/occluded scenes) and detection (strong in sparse scenes) for multi-class counting across density regimes. The public code release and trained weights are a clear strength supporting reproducibility and follow-up work.

major comments (3)

[§4] §4 (Experiments) and Table 2: The 33/43/64% MAE reductions versus prior multi-class density estimators are reported without standard deviations across runs, without p-values, and without an explicit list of the exact prior methods and their reported numbers; this makes the 'leap in performance' claim difficult to verify as load-bearing for the central contribution.
[§3.2] §3.2 (Category Focus Module): The assertion that the auxiliary segmentation loss improves the density head exclusively at training time (with no residual coupling via shared backbone features or loss weighting) is not supported by an ablation that evaluates the density head both with and without the module at inference; without this isolation, the comparison to YOLO11 in crowded samples and the claim of avoiding prior methods' inference constraints remain insecure.
[§4.3] §4.3 (Comparison with detectors): The order-of-magnitude MAE improvement over YOLO11 in the most crowded test samples is presented without defining the density threshold used to select those samples or reporting the number of such samples; this detail is required to assess whether the result generalizes or is driven by a small subset.

minor comments (3)

[Abstract] The abstract states '33%, 43%, and 64% reduction to MAE' but does not name the three prior methods or the corresponding absolute MAE values; adding these would improve clarity.
[Figure 3] Figure 3 (architecture diagram) would benefit from an explicit inference-time path annotation showing that the Category Focus Module is removed, to visually support the decoupling claim.
[§3.1] The Twins-SVT backbone citation is present but the exact variant (e.g., Twins-SVT-B) and input resolution used should be stated in §3.1 for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve clarity and verifiability of the results.

read point-by-point responses

Referee: [§4] §4 (Experiments) and Table 2: The 33/43/64% MAE reductions versus prior multi-class density estimators are reported without standard deviations across runs, without p-values, and without an explicit list of the exact prior methods and their reported numbers; this makes the 'leap in performance' claim difficult to verify as load-bearing for the central contribution.

Authors: We agree that additional statistical detail would strengthen the presentation. In the revised manuscript we will expand Table 2 to list the exact MAE values reported by each prior multi-class density estimator (with citations), compute the precise percentage reductions from those numbers, and add results from multiple independent training runs (different random seeds) reporting mean MAE together with standard deviations. We will also include paired statistical significance tests and the corresponding p-values for the main comparisons. revision: yes
Referee: [§3.2] §3.2 (Category Focus Module): The assertion that the auxiliary segmentation loss improves the density head exclusively at training time (with no residual coupling via shared backbone features or loss weighting) is not supported by an ablation that evaluates the density head both with and without the module at inference; without this isolation, the comparison to YOLO11 in crowded samples and the claim of avoiding prior methods' inference constraints remain insecure.

Authors: The Category Focus Module applies an auxiliary segmentation loss only during training; at inference the module is completely removed and the density head operates without any segmentation output or extra constraints. To make this isolation explicit we will add an ablation (new Table or figure in §3.2 and §4) that compares density-estimation performance of models trained with versus without the Category Focus Module. Because the module is discarded after training, no inference-time ablation of the module itself is possible or necessary; the new training-only ablation directly quantifies the benefit to the density head while confirming zero inference overhead. revision: yes
Referee: [§4.3] §4.3 (Comparison with detectors): The order-of-magnitude MAE improvement over YOLO11 in the most crowded test samples is presented without defining the density threshold used to select those samples or reporting the number of such samples; this detail is required to assess whether the result generalizes or is driven by a small subset.

Authors: We accept that the selection criterion must be stated explicitly. In the revised §4.3 we will define the density threshold (e.g., images whose object count exceeds the 90th percentile of the test-set distribution or a concrete per-pixel density value) used to isolate the most crowded samples and will report the exact number of test images satisfying the criterion. This information will allow readers to judge the scope of the reported improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical multi-class density estimation model

full rationale

The paper proposes an empirical architecture combining a Twins-SVT vision transformer backbone with a multiscale CNN decoder and a training-only auxiliary segmentation task via the Category Focus Module. Performance claims consist of reported MAE reductions (33/43/64%) on VisDrone and iSAID benchmarks plus comparisons to prior density estimators and YOLO11; these are direct experimental outcomes rather than derivations. No equations, first-principles results, or predictions appear that reduce to fitted inputs by construction. The auxiliary task is explicitly described as improving the density head at training time without inference constraints, presented as a design distinction from prior work rather than a self-referential reduction. Public code and weights further enable external verification. The derivation chain is self-contained as standard model design plus benchmark evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard density-map counting assumptions plus a newly introduced training module whose benefit is demonstrated only empirically on two benchmarks.

free parameters (1)

Training hyperparameters and loss weights
Typical deep learning model choices not detailed in abstract but required for reproduction.

axioms (1)

domain assumption Density map summation accurately yields object counts even under heavy occlusion and class ambiguity
Foundational premise stated in the abstract for why density estimation is used instead of detection.

invented entities (1)

Category Focus Module no independent evidence
purpose: Suppress inter-category interference during training only
New component introduced to improve the density head without inference-time constraints.

pith-pipeline@v0.9.0 · 5807 in / 1296 out tokens · 66005 ms · 2026-05-18T10:13:28.878129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

Breeze, Alison P

Tom D. Breeze, Alison P. Bailey, Kelvin G. Balcombe, Tom Brereton, Richard Comont, Mike Edwards, Michael P. Gar- ratt, Martin Harvey, Cathy Hawes, Nick Isaac, Mark Jitlal, Catherine M. Jones, William E. Kunin, Paul Lee, Roger K. A. Morris, Andy Musgrove, Rory S. O’Connor, Jodey Peyton, Simon G. Potts, Stuart P. M. Roberts, David B. Roy, Helen E. Roy, Cuon...

work page 2021
[2]

Scale aggregation network for accurate and efficient crowd count- ing

Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd count- ing. InProceedings of the European Conference on Com- puter Vision (ECCV), 2018. 2

work page 2018
[3]

Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 2, 3, 4

work page 2021
[4]

Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

Mingpeng Cui, Guanchen Ding, Daiqin Yang, and Zhen- zhong Chen. Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024. 2, 7, 8

work page 2024
[5]

Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024

Li Dong, Haijun Zhang, Dongliang Zhou, Jianyang Shi, and Jianghong Ma. Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024. 1, 3, 4

work page 2024
[6]

Centernet: Keypoint triplets for object detection

Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019. 3

work page 2019
[7]

Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,

Qiyan Fu, Weidong Min, Weixiang Sheng, and Chunjiang Peng. Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,

work page
[8]

Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021

Guangshuai Gao, Qingjie Liu, and Yunhong Wang. Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021. 2

work page 2021
[9]

Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024

Junyu Gao, Liangliang Zhao, and Xuelong Li. Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024. 2

work page 2024
[10]

Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021

Adrian Salazar Gomez, Erchan Aptoula, Simon Parsons, and Petra Bosilj. Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021. 2

work page 2021
[11]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 3

work page 2017
[12]

Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021

Damien Hicks, Mathilde Baude, Christoph Kratz, Pierre Ou- vrard, and Graham Stone. Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021. 1, 2, 4, 5, 7

work page 2021
[13]

Ultralytics yolo11, 2024

Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2, 6, 7

work page 2024
[14]

Yolov11: An overview of the key architectural enhancements, 2024

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements, 2024. 2

work page 2024
[15]

Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M

Falko Lavitt, Demi J. Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M. Tomczak. Deep learning and transfer learning for automatic cell counting in microscope images of human cancer cell lines.Applied Sci- ences, 11(11), 2021. 1

work page 2021
[16]

Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes

Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[17]

Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022

Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022. 1, 4

work page 2022
[18]

Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025

Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, and Deyu Meng. Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025. 4

work page 2025
[19]

Context- aware crowd counting

Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context- aware crowd counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1, 2

work page 2019
[20]

Bayesian loss for crowd count estimation with point super- vision

Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point super- vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019. 2

work page 2019
[21]

Class-aware object counting

Andreas Michel, Wolfgang Gross, Fabian Schenkel, and Wolfgang Middelmann. Class-aware object counting. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision (WACV) Workshops, pages 469– 478, 2022. 2, 3, 4, 5, 6, 7, 8

work page 2022
[22]

Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020

Nuri Erkin Ocer, Gordana Kaplan, Firat Erdem, Dilek Ku- cuk Matci, and Ugur Avdan. Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020. 1

work page 2020
[23]

Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution

Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10213–10224, 2021. 3

work page 2021
[24]

Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 3

work page 2017
[25]

A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019

Jie Shen, Xin Xiong, Zhiyuan Xue, and Yinglong Bian. A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019. 1

work page 2019
[26]

Very deep convo- lutional networks for large-scale image recognition, 2015

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition, 2015. 2, 4

work page 2015
[27]

Cc- trans: Simplifying and improving crowd counting with trans- former, 2021

Ye Tian, Xiangxiang Chu, and Hongpeng Wang. Cc- trans: Simplifying and improving crowd counting with trans- former, 2021. 4

work page 2021
[28]

Greenhouse gas reporting: conversion factors 2025, 2025

UK Gov’t Department for Energy Security and Net Zero. Greenhouse gas reporting: conversion factors 2025, 2025. [Online; accessed 07-September-2025]. 8

work page 2025
[29]

isaid: A large-scale dataset for instance segmentation in aerial images

Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019. 1, 2, 5, 7

work page 2019
[30]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 5

work page 2018
[31]

Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021

Wei Xu, Dingkang Liang, Yixiao Zheng, Jiahao Xie, and Zhanyu Ma. Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021. 2, 3, 4, 5, 6, 7, 8

work page 2021
[32]

Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,

Jiamao Yu and Hexuan Hu. Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,

work page
[33]

Single-image crowd counting via multi-column convolutional neural network

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[34]

Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021. 1, 2, 5, 6, 7

work page 2021
[35]

Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 2

work page 2023

[1] [1]

Breeze, Alison P

Tom D. Breeze, Alison P. Bailey, Kelvin G. Balcombe, Tom Brereton, Richard Comont, Mike Edwards, Michael P. Gar- ratt, Martin Harvey, Cathy Hawes, Nick Isaac, Mark Jitlal, Catherine M. Jones, William E. Kunin, Paul Lee, Roger K. A. Morris, Andy Musgrove, Rory S. O’Connor, Jodey Peyton, Simon G. Potts, Stuart P. M. Roberts, David B. Roy, Helen E. Roy, Cuon...

work page 2021

[2] [2]

Scale aggregation network for accurate and efficient crowd count- ing

Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd count- ing. InProceedings of the European Conference on Com- puter Vision (ECCV), 2018. 2

work page 2018

[3] [3]

Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 2, 3, 4

work page 2021

[4] [4]

Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

Mingpeng Cui, Guanchen Ding, Daiqin Yang, and Zhen- zhong Chen. Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024. 2, 7, 8

work page 2024

[5] [5]

Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024

Li Dong, Haijun Zhang, Dongliang Zhou, Jianyang Shi, and Jianghong Ma. Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024. 1, 3, 4

work page 2024

[6] [6]

Centernet: Keypoint triplets for object detection

Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019. 3

work page 2019

[7] [7]

Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,

Qiyan Fu, Weidong Min, Weixiang Sheng, and Chunjiang Peng. Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,

work page

[8] [8]

Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021

Guangshuai Gao, Qingjie Liu, and Yunhong Wang. Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021. 2

work page 2021

[9] [9]

Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024

Junyu Gao, Liangliang Zhao, and Xuelong Li. Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024. 2

work page 2024

[10] [10]

Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021

Adrian Salazar Gomez, Erchan Aptoula, Simon Parsons, and Petra Bosilj. Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021. 2

work page 2021

[11] [11]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 3

work page 2017

[12] [12]

Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021

Damien Hicks, Mathilde Baude, Christoph Kratz, Pierre Ou- vrard, and Graham Stone. Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021. 1, 2, 4, 5, 7

work page 2021

[13] [13]

Ultralytics yolo11, 2024

Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2, 6, 7

work page 2024

[14] [14]

Yolov11: An overview of the key architectural enhancements, 2024

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements, 2024. 2

work page 2024

[15] [15]

Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M

Falko Lavitt, Demi J. Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M. Tomczak. Deep learning and transfer learning for automatic cell counting in microscope images of human cancer cell lines.Applied Sci- ences, 11(11), 2021. 1

work page 2021

[16] [16]

Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes

Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page

[17] [17]

Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022

Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022. 1, 4

work page 2022

[18] [18]

Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025

Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, and Deyu Meng. Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025. 4

work page 2025

[19] [19]

Context- aware crowd counting

Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context- aware crowd counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1, 2

work page 2019

[20] [20]

Bayesian loss for crowd count estimation with point super- vision

Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point super- vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019. 2

work page 2019

[21] [21]

Class-aware object counting

Andreas Michel, Wolfgang Gross, Fabian Schenkel, and Wolfgang Middelmann. Class-aware object counting. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision (WACV) Workshops, pages 469– 478, 2022. 2, 3, 4, 5, 6, 7, 8

work page 2022

[22] [22]

Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020

Nuri Erkin Ocer, Gordana Kaplan, Firat Erdem, Dilek Ku- cuk Matci, and Ugur Avdan. Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020. 1

work page 2020

[23] [23]

Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution

Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10213–10224, 2021. 3

work page 2021

[24] [24]

Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 3

work page 2017

[25] [25]

A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019

Jie Shen, Xin Xiong, Zhiyuan Xue, and Yinglong Bian. A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019. 1

work page 2019

[26] [26]

Very deep convo- lutional networks for large-scale image recognition, 2015

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition, 2015. 2, 4

work page 2015

[27] [27]

Cc- trans: Simplifying and improving crowd counting with trans- former, 2021

Ye Tian, Xiangxiang Chu, and Hongpeng Wang. Cc- trans: Simplifying and improving crowd counting with trans- former, 2021. 4

work page 2021

[28] [28]

Greenhouse gas reporting: conversion factors 2025, 2025

UK Gov’t Department for Energy Security and Net Zero. Greenhouse gas reporting: conversion factors 2025, 2025. [Online; accessed 07-September-2025]. 8

work page 2025

[29] [29]

isaid: A large-scale dataset for instance segmentation in aerial images

Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019. 1, 2, 5, 7

work page 2019

[30] [30]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 5

work page 2018

[31] [31]

Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021

Wei Xu, Dingkang Liang, Yixiao Zheng, Jiahao Xie, and Zhanyu Ma. Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021. 2, 3, 4, 5, 6, 7, 8

work page 2021

[32] [32]

Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,

Jiamao Yu and Hexuan Hu. Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,

work page

[33] [33]

Single-image crowd counting via multi-column convolutional neural network

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016

[34] [34]

Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021. 1, 2, 5, 6, 7

work page 2021

[35] [35]

Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023

Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 2

work page 2023