Dense Scale Network for Crowd Counting
Pith reviewed 2026-05-25 18:02 UTC · model grok-4.3
The pith
Dense dilated convolution blocks with residual links capture many scales in a single network for crowd counting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a single-column architecture built from three cascaded dense dilated convolution blocks, linked by dense residual connections and trained with a consistency loss, can represent a wider and more continuous range of scales than previous multi-column or multi-branch networks, yielding the best reported counts on ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD.
What carries the argument
The dense dilated convolution block, in which each dilation layer is densely connected to the others to keep information from many scales while selected rates prevent gridding artifacts.
If this is right
- DSNet reports the highest accuracy among compared methods on all four datasets.
- Relative error reductions reach 30 percent on UCF-QNRF and UCF_CC_50 and 20 percent on the remaining two sets.
- The network trains end-to-end without column- or branch-specific hyper-parameters.
- The added multi-scale density level consistency loss contributes to the observed gains.
Where Pith is reading between the lines
- The same block structure could be inserted into other dense-prediction tasks that suffer from scale variation, such as semantic segmentation.
- If the block works as described, future counting work may need fewer hand-designed multi-branch modules.
- Applying the network to video sequences could test whether the scale coverage also improves temporal consistency.
Load-bearing premise
The specific dilation rates inside each block plus the dense residual links across blocks will cover the needed scales without gridding artifacts or extra per-dataset tuning.
What would settle it
Measure counting error on a new crowd dataset whose scale distribution lies clearly outside the ranges seen in the four evaluated sets; if error does not stay below prior state-of-the-art methods the central claim is weakened.
Figures
read the original abstract
Crowd counting has been widely studied by computer vision community in recent years. Due to the large scale variation, it remains to be a challenging task. Previous methods adopt either multi-column CNN or single-column CNN with multiple branches to deal with this problem. However, restricted by the number of columns or branches, these methods can only capture a few different scales and have limited capability. In this paper, we propose a simple but effective network called DSNet for crowd counting, which can be easily trained in an end-to-end fashion. The key component of our network is the dense dilated convolution block, in which each dilation layer is densely connected with the others to preserve information from continuously varied scales. The dilation rates in dilation layers are carefully selected to prevent the block from gridding artifacts. To further enlarge the range of scales covered by the network, we cascade three blocks and link them with dense residual connections. We also introduce a novel multi-scale density level consistency loss for performance improvement. To evaluate our method, we compare it with state-of-the-art algorithms on four crowd counting datasets (ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD). Experimental results demonstrate that DSNet can achieve the best performance and make significant improvements on all the four datasets (30% on the UCF-QNRF and UCF_CC_50, and 20% on the others).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DSNet, a single-column CNN for crowd counting that employs dense dilated convolution blocks (with dilation rates selected to avoid gridding) densely connected across layers, cascaded in three blocks with dense residual links, plus a multi-scale density level consistency loss. It claims state-of-the-art results on ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD, with 20-30% gains over prior methods.
Significance. If the empirical claims are reproducible and the gains are shown to stem from the architecture rather than tuning, the work would be significant for demonstrating that dense residual connections among dilated convolutions can capture continuous scale ranges in a compact network, offering a simpler alternative to multi-column or multi-branch designs for scale variation in crowd counting.
major comments (3)
- [Method] Method section (dense dilated convolution block description): the central architectural claim attributes performance gains to 'carefully selected' dilation rates that prevent gridding while covering continuously varied scales, yet no explicit rates, selection rule, or derivation is supplied; without this, it is impossible to assess whether the rates generalize or are tuned to the scale statistics of the four evaluation datasets.
- [Experiments] Experiments section: no ablation results are reported on the dilation rates, the choice of three cascaded blocks, the dense residual connections, or the loss weighting coefficients, so the 30% improvement on UCF-QNRF and UCF_CC_50 cannot be confidently attributed to the proposed dense connections rather than hyper-parameter search.
- [Abstract] Abstract and experimental claims: performance numbers are stated without reference to the evaluation protocol, baseline implementations, error bars, or statistical tests, which is load-bearing for the headline claim of 'best performance' and 'significant improvements'.
minor comments (1)
- [Abstract] The abstract does not specify the evaluation metrics (MAE/MSE) used to report the 20-30% gains.
Simulated Author's Rebuttal
Thank you for the constructive review. We address each major comment point by point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Method] Method section (dense dilated convolution block description): the central architectural claim attributes performance gains to 'carefully selected' dilation rates that prevent gridding while covering continuously varied scales, yet no explicit rates, selection rule, or derivation is supplied; without this, it is impossible to assess whether the rates generalize or are tuned to the scale statistics of the four evaluation datasets.
Authors: We agree the rates and rule must be explicit. Rates were chosen (1,2,3; 1,2,4; 1,2,5 across the three blocks) to avoid gridding by ensuring no shared factors between successive rates while spanning continuous scales; the rule derives from the gridding condition in dilated convolution literature. We will add the exact rates, table, and derivation to the method section. revision: yes
-
Referee: [Experiments] Experiments section: no ablation results are reported on the dilation rates, the choice of three cascaded blocks, the dense residual connections, or the loss weighting coefficients, so the 30% improvement on UCF-QNRF and UCF_CC_50 cannot be confidently attributed to the proposed dense connections rather than hyper-parameter search.
Authors: The referee correctly notes that ablations are required for attribution. The original submission emphasized end-to-end SOTA comparisons; we will add ablations on dilation rates, block count, dense residuals, and loss weights in the revised experiments. revision: yes
-
Referee: [Abstract] Abstract and experimental claims: performance numbers are stated without reference to the evaluation protocol, baseline implementations, error bars, or statistical tests, which is load-bearing for the headline claim of 'best performance' and 'significant improvements'.
Authors: We will revise the abstract and experiments to state that MAE/MSE follow the standard test splits defined in each dataset paper, baselines are from the cited original works, and note single-run results (standard in the field). Error bars and tests were not computed originally. revision: partial
Circularity Check
No circularity in empirical architecture proposal
full rationale
The paper proposes DSNet as an empirical CNN architecture with dense dilated blocks, selected dilation rates, residual connections, and a consistency loss, then reports experimental results on public datasets. No equations, derivations, or first-principles predictions exist that could reduce to inputs by construction. No self-citation chains, fitted parameters renamed as predictions, or uniqueness theorems are invoked. The design choices are presented as engineering decisions validated by ablation and comparison, making the work self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- dilation rates
- loss weighting coefficients
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking (D=3 forcing) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The dilation rates in dilation layers are carefully selected to prevent the block from gridding artifacts... we cascade three blocks and link them with dense residual connections.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat (8-tick / period-8 structure) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dense dilated convolution block... dilation rate of 1, 2, 3... three scale levels... output size of 1×1, 2×2 and 4×4
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan. Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3618–3626, 2018
work page 2018
-
[2]
L. Boominathan, S. S. Kruthiventi, and R. V . Babu. Crowd- net: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia, pages 640–644. ACM, 2016
work page 2016
-
[3]
X. Cao, Z. Wang, Y . Zhao, and F. Su. Scale aggregation network for accurate and efficient crowd counting. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018
work page 2018
-
[4]
A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy pre- serving crowd monitoring: Counting people without people models or tracking. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008
work page 2008
-
[5]
A. B. Chan and N. Vasconcelos. Bayesian poisson regres- sion for crowd counting. In 2009 IEEE 12th international conference on computer vision, pages 545–551. IEEE, 2009
work page 2009
- [6]
-
[7]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016
work page 2016
- [8]
- [9]
- [10]
-
[11]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[12]
V . Lempitsky and A. Zisserman. Learning to count objects in images. In Advances in neural information processing systems, pages 1324–1332, 2010
work page 2010
-
[13]
Y . Li, X. Zhang, and D. Chen. Csrnet: Dilated convo- lutional neural networks for understanding the highly con- gested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018
work page 2018
-
[14]
D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, page 1150. Ieee, 1999
work page 1999
-
[15]
D. Onoro-Rubio and R. J. L ´opez-Sastre. Towards perspective-free object counting with deep learning. In Eu- ropean Conference on Computer Vision , pages 615–629. Springer, 2016
work page 2016
- [16]
-
[17]
V .-Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE International Conference on Computer Vision , pages 3253–3261, 2015
work page 2015
- [18]
-
[19]
M. Rodriguez, I. Laptev, J. Sivic, and J.-Y . Audibert. Density-aware person detection and tracking in crowds. In 2011 International Conference on Computer Vision , pages 2423–2430. IEEE, 2011
work page 2011
-
[20]
D. Ryan, S. Denman, C. Fookes, and S. Sridharan. Crowd counting using multiple local features. In 2009 Digital Im- age Computing: Techniques and Applications, pages 81–88. IEEE, 2009
work page 2009
-
[21]
D. B. Sam, S. Surya, and R. V . Babu. Switching convo- lutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4031–4039. IEEE, 2017
work page 2017
- [22]
-
[23]
Z. Shen, Y . Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5245–5254, 2018
work page 2018
-
[24]
Z. Shi, L. Zhang, Y . Liu, X. Cao, Y . Ye, M.-M. Cheng, and G. Zheng. Crowd counting with deep negative corre- lation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5382–5390, 2018
work page 2018
-
[25]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[26]
V . A. Sindagi and V . M. Patel. Cnn-based cascaded multi- task learning of high-level prior and density estimation for crowd counting. In 2017 14th IEEE International Con- ference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2017
work page 2017
-
[27]
V . A. Sindagi and V . M. Patel. Generating high-quality crowd density maps using contextual pyramid cnns. InProceedings 9 of the IEEE International Conference on Computer Vision , pages 1861–1870, 2017
work page 2017
-
[28]
E. Walach and L. Wolf. Learning to count with cnn boosting. In European conference on computer vision, pages 660–676. Springer, 2016
work page 2016
-
[29]
M. Wang and X. Wang. Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In 2011 IEEE Conference on Computer Vision and Pattern Recognition , pages 3401–3408. IEEE, 2011
work page 2011
- [30]
-
[31]
M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3684–3692, 2018
work page 2018
- [32]
- [33]
- [34]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.