Getting the Numbers Rightunicode{x2014}Modelling Multi-Class Object Counting in Dense and Varied Scenes
Pith reviewed 2026-05-18 10:13 UTC · model grok-4.3
The pith
A vision transformer backbone with a training-only category focus module enables accurate multi-class object counting in both dense and sparse scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a Twins-SVT pyramid vision transformer backbone combined with a multiscale CNN decoder and a Category Focus Module for an auxiliary segmentation task applied only at training time produces multi-class density maps that remain accurate across wide density variations, delivering 33 percent, 43 percent, and 64 percent reductions in MAE on the VisDrone and iSAID test sets while outperforming YOLO11 by an order of magnitude in the most crowded samples.
What carries the argument
The Category Focus Module, which applies an auxiliary segmentation task during training to suppress inter-category interference in the density estimation head without using the task or its constraints at inference time.
If this is right
- Multi-class density estimation no longer needs to trade off performance between dense and sparse scenes.
- The method bridges the gap where prior density estimators degrade in low-density images and detectors like YOLO11 lose accuracy in high-density images.
- Practical counting systems can use a single model for varied real-world conditions without auxiliary tasks at test time.
- Error reductions of one-third to two-thirds suggest measurable gains in applications that rely on accurate class-specific counts.
Where Pith is reading between the lines
- The non-exclusive class modeling could transfer to other vision tasks that must handle overlapping or ambiguous categories.
- Combining the approach with temporal information from video could extend robust counting to dynamic scenes.
- Further scaling the transformer backbone might yield additional gains on larger or more varied datasets.
Load-bearing premise
The auxiliary segmentation task can improve the density estimation head during training without introducing biases that would require applying the same task or its assumptions at inference time.
What would settle it
Evaluating the model on a held-out dataset containing object categories or density extremes absent from VisDrone and iSAID would show whether the reported error reductions and cross-density robustness persist.
Figures
read the original abstract
Density map estimation enables accurate object counting in heavily occluded, and densely packed scenes where detection-based counting fails. In multi-class density estimation, class awareness can be introduced by modelling classes non-exclusively, better reflecting crowded and visually ambiguous contexts. However, existing multi-class density estimators often degrade in less-dense scenes, while state-of-the-art detectors still struggle in the most congested settings. To bridge this gap, we propose the first vision-transformer-based approach to multi-class density estimation. Our model combines a Twins-SVT pyramid vision transformer backbone with a multiscale CNN decoder that leverages hierarchical features for robust counting across a wide range of densities. Further to that, the method adds an auxiliary segmentation task with the Category Focus Module to suppress inter-category interference at training time. The module improves the density estimation head without the need for constraining assumptions added by the application of the auxiliary task at inference time, as required in previous methods. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates a leap in performance versus the previous state-of-the-art multi-class density estimation methods, attaining a 33%, 43%, and 64% reduction to MAE in testing evaluation. The method outperforms YOLO11 in less busy scenes, exceeding it by an order of magnitude in the most crowded testing samples. Code, and trained weights available at https://github.com/LCAS/gnr_mcdest.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the first vision-transformer-based multi-class density estimation approach for object counting. It combines a Twins-SVT pyramid backbone with a multiscale CNN decoder and introduces a Category Focus Module that applies an auxiliary segmentation task solely during training to suppress inter-category interference. The method is evaluated on VisDrone and iSAID, claiming 33%, 43%, and 64% MAE reductions versus prior multi-class density estimators, plus outperformance of YOLO11 (by an order of magnitude in the densest samples) without imposing inference-time constraints from the auxiliary task. Public code and weights are released.
Significance. If the central performance claims hold after verification of the training-only decoupling, the work would meaningfully bridge density estimation (strong in dense/occluded scenes) and detection (strong in sparse scenes) for multi-class counting across density regimes. The public code release and trained weights are a clear strength supporting reproducibility and follow-up work.
major comments (3)
- [§4] §4 (Experiments) and Table 2: The 33/43/64% MAE reductions versus prior multi-class density estimators are reported without standard deviations across runs, without p-values, and without an explicit list of the exact prior methods and their reported numbers; this makes the 'leap in performance' claim difficult to verify as load-bearing for the central contribution.
- [§3.2] §3.2 (Category Focus Module): The assertion that the auxiliary segmentation loss improves the density head exclusively at training time (with no residual coupling via shared backbone features or loss weighting) is not supported by an ablation that evaluates the density head both with and without the module at inference; without this isolation, the comparison to YOLO11 in crowded samples and the claim of avoiding prior methods' inference constraints remain insecure.
- [§4.3] §4.3 (Comparison with detectors): The order-of-magnitude MAE improvement over YOLO11 in the most crowded test samples is presented without defining the density threshold used to select those samples or reporting the number of such samples; this detail is required to assess whether the result generalizes or is driven by a small subset.
minor comments (3)
- [Abstract] The abstract states '33%, 43%, and 64% reduction to MAE' but does not name the three prior methods or the corresponding absolute MAE values; adding these would improve clarity.
- [Figure 3] Figure 3 (architecture diagram) would benefit from an explicit inference-time path annotation showing that the Category Focus Module is removed, to visually support the decoupling claim.
- [§3.1] The Twins-SVT backbone citation is present but the exact variant (e.g., Twins-SVT-B) and input resolution used should be stated in §3.1 for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve clarity and verifiability of the results.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and Table 2: The 33/43/64% MAE reductions versus prior multi-class density estimators are reported without standard deviations across runs, without p-values, and without an explicit list of the exact prior methods and their reported numbers; this makes the 'leap in performance' claim difficult to verify as load-bearing for the central contribution.
Authors: We agree that additional statistical detail would strengthen the presentation. In the revised manuscript we will expand Table 2 to list the exact MAE values reported by each prior multi-class density estimator (with citations), compute the precise percentage reductions from those numbers, and add results from multiple independent training runs (different random seeds) reporting mean MAE together with standard deviations. We will also include paired statistical significance tests and the corresponding p-values for the main comparisons. revision: yes
-
Referee: [§3.2] §3.2 (Category Focus Module): The assertion that the auxiliary segmentation loss improves the density head exclusively at training time (with no residual coupling via shared backbone features or loss weighting) is not supported by an ablation that evaluates the density head both with and without the module at inference; without this isolation, the comparison to YOLO11 in crowded samples and the claim of avoiding prior methods' inference constraints remain insecure.
Authors: The Category Focus Module applies an auxiliary segmentation loss only during training; at inference the module is completely removed and the density head operates without any segmentation output or extra constraints. To make this isolation explicit we will add an ablation (new Table or figure in §3.2 and §4) that compares density-estimation performance of models trained with versus without the Category Focus Module. Because the module is discarded after training, no inference-time ablation of the module itself is possible or necessary; the new training-only ablation directly quantifies the benefit to the density head while confirming zero inference overhead. revision: yes
-
Referee: [§4.3] §4.3 (Comparison with detectors): The order-of-magnitude MAE improvement over YOLO11 in the most crowded test samples is presented without defining the density threshold used to select those samples or reporting the number of such samples; this detail is required to assess whether the result generalizes or is driven by a small subset.
Authors: We accept that the selection criterion must be stated explicitly. In the revised §4.3 we will define the density threshold (e.g., images whose object count exceeds the 90th percentile of the test-set distribution or a concrete per-pixel density value) used to isolate the most crowded samples and will report the exact number of test images satisfying the criterion. This information will allow readers to judge the scope of the reported improvement. revision: yes
Circularity Check
No significant circularity in empirical multi-class density estimation model
full rationale
The paper proposes an empirical architecture combining a Twins-SVT vision transformer backbone with a multiscale CNN decoder and a training-only auxiliary segmentation task via the Category Focus Module. Performance claims consist of reported MAE reductions (33/43/64%) on VisDrone and iSAID benchmarks plus comparisons to prior density estimators and YOLO11; these are direct experimental outcomes rather than derivations. No equations, first-principles results, or predictions appear that reduce to fitted inputs by construction. The auxiliary task is explicitly described as improving the density head at training time without inference constraints, presented as a design distinction from prior work rather than a self-referential reduction. Public code and weights further enable external verification. The derivation chain is self-contained as standard model design plus benchmark evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training hyperparameters and loss weights
axioms (1)
- domain assumption Density map summation accurately yields object counts even under heavy occlusion and class ambiguity
invented entities (1)
-
Category Focus Module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tom D. Breeze, Alison P. Bailey, Kelvin G. Balcombe, Tom Brereton, Richard Comont, Mike Edwards, Michael P. Gar- ratt, Martin Harvey, Cathy Hawes, Nick Isaac, Mark Jitlal, Catherine M. Jones, William E. Kunin, Paul Lee, Roger K. A. Morris, Andy Musgrove, Rory S. O’Connor, Jodey Peyton, Simon G. Potts, Stuart P. M. Roberts, David B. Roy, Helen E. Roy, Cuon...
work page 2021
-
[2]
Scale aggregation network for accurate and efficient crowd count- ing
Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd count- ing. InProceedings of the European Conference on Com- puter Vision (ECCV), 2018. 2
work page 2018
-
[3]
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers.Advances in neural information processing systems, 34:9355–9366, 2021. 2, 3, 4
work page 2021
-
[4]
Mingpeng Cui, Guanchen Ding, Daiqin Yang, and Zhen- zhong Chen. Dopnet: Dense object prediction network for multiclass object counting and localization in remote sens- ing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024. 2, 7, 8
work page 2024
-
[5]
Li Dong, Haijun Zhang, Dongliang Zhou, Jianyang Shi, and Jianghong Ma. Cctwins: A weakly supervised transformer- based crowd counting method with adaptive scene consis- tency attention.IEEE Transactions on Consumer Electron- ics, 70(1):22–35, 2024. 1, 3, 4
work page 2024
-
[6]
Centernet: Keypoint triplets for object detection
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing- ming Huang, and Qi Tian. Centernet: Keypoint triplets for object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2019. 3
work page 2019
-
[7]
Qiyan Fu, Weidong Min, Weixiang Sheng, and Chunjiang Peng. Counting dense object of multiple types based on fea- ture enhancement.Frontiers in Neurorobotics, 18:1383943,
-
[8]
Guangshuai Gao, Qingjie Liu, and Yunhong Wang. Counting from sky: A large-scale data set for remote sensing object counting and a benchmark method.IEEE Transactions on Geoscience and Remote Sensing, 59(5):3642–3655, 2021. 2
work page 2021
-
[9]
Junyu Gao, Liangliang Zhao, and Xuelong Li. Nwpu-moc: A benchmark for fine-grained multicategory object counting in aerial images.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–14, 2024. 2
work page 2024
-
[10]
Adrian Salazar Gomez, Erchan Aptoula, Simon Parsons, and Petra Bosilj. Deep regression versus detection for counting in robotic phenotyping.IEEE Robotics and Automation Letters, 6(2):2902–2907, 2021. 2
work page 2021
-
[11]
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 3
work page 2017
-
[12]
Damien Hicks, Mathilde Baude, Christoph Kratz, Pierre Ou- vrard, and Graham Stone. Deep learning object detection to estimate the nectar sugar mass of flowering vegetation.Eco- logical Solutions and Evidence, 2(3):e12099, 2021. 1, 2, 4, 5, 7
work page 2021
-
[13]
Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 2, 6, 7
work page 2024
-
[14]
Yolov11: An overview of the key architectural enhancements, 2024
Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements, 2024. 2
work page 2024
-
[15]
Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M
Falko Lavitt, Demi J. Rijlaarsdam, Dennet van der Linden, Ewelina Weglarz-Tomczak, and Jakub M. Tomczak. Deep learning and transfer learning for automatic cell counting in microscope images of human cancer cell lines.Applied Sci- ences, 11(11), 2021. 1
work page 2021
-
[16]
Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes
Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Di- lated convolutional neural networks for understanding the highly congested scenes. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[17]
Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022. 1, 4
work page 2022
-
[18]
Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, and Deyu Meng. Semi-supervised count- ing via pixel-by-pixel density distribution modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3625–3638, 2025. 4
work page 2025
-
[19]
Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context- aware crowd counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 1, 2
work page 2019
-
[20]
Bayesian loss for crowd count estimation with point super- vision
Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point super- vision. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2019. 2
work page 2019
-
[21]
Andreas Michel, Wolfgang Gross, Fabian Schenkel, and Wolfgang Middelmann. Class-aware object counting. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision (WACV) Workshops, pages 469– 478, 2022. 2, 3, 4, 5, 6, 7, 8
work page 2022
-
[22]
Nuri Erkin Ocer, Gordana Kaplan, Firat Erdem, Dilek Ku- cuk Matci, and Ugur Avdan. Tree extraction from multi- scale uav images using mask r-cnn with fpn.Remote sensing letters, 11(9):847–856, 2020. 1
work page 2020
-
[23]
Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution
Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switch- able atrous convolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10213–10224, 2021. 3
work page 2021
-
[24]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017. 3
work page 2017
-
[25]
Jie Shen, Xin Xiong, Zhiyuan Xue, and Yinglong Bian. A convolutional neural-network-based pedestrian counting model for various crowded scenes.Computer-Aided Civil and Infrastructure Engineering, 34(10):897–914, 2019. 1
work page 2019
-
[26]
Very deep convo- lutional networks for large-scale image recognition, 2015
Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition, 2015. 2, 4
work page 2015
-
[27]
Cc- trans: Simplifying and improving crowd counting with trans- former, 2021
Ye Tian, Xiangxiang Chu, and Hongpeng Wang. Cc- trans: Simplifying and improving crowd counting with trans- former, 2021. 4
work page 2021
-
[28]
Greenhouse gas reporting: conversion factors 2025, 2025
UK Gov’t Department for Energy Security and Net Zero. Greenhouse gas reporting: conversion factors 2025, 2025. [Online; accessed 07-September-2025]. 8
work page 2025
-
[29]
isaid: A large-scale dataset for instance segmentation in aerial images
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 28–37, 2019. 1, 2, 5, 7
work page 2019
-
[30]
Dota: A large-scale dataset for object detection in aerial images
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 5
work page 2018
-
[31]
Wei Xu, Dingkang Liang, Yixiao Zheng, Jiahao Xie, and Zhanyu Ma. Dilated-scale-aware category-attention convnet for multi-class object counting.IEEE Signal Processing Let- ters, 28:1570–1574, 2021. 2, 3, 4, 5, 6, 7, 8
work page 2021
-
[32]
Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,
Jiamao Yu and Hexuan Hu. Multiscale regional calibration network for crowd counting.Scientific Reports, 15(1):2866,
-
[33]
Single-image crowd counting via multi-column convolutional neural network
Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2
work page 2016
-
[34]
Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2021. 1, 2, 5, 6, 7
work page 2021
-
[35]
Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023
Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey.Proceed- ings of the IEEE, 111(3):257–276, 2023. 2
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.