pith. sign in

arxiv: 1906.09707 · v1 · pith:IJ4BHMTOnew · submitted 2019-06-24 · 💻 cs.CV

Dense Scale Network for Crowd Counting

Pith reviewed 2026-05-25 18:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords crowd countingdense dilated convolutionscale variationconvolutional neural networkdensity map estimationmulti-scale featuresresidual connections
0
0 comments X

The pith

Dense dilated convolution blocks with residual links capture many scales in a single network for crowd counting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds DSNet around a dense dilated convolution block in which layers at different dilation rates connect directly to one another. This preserves feature maps from continuously changing scales while the chosen rates avoid gridding artifacts. Three blocks are stacked and joined by dense residual connections to widen the covered scale range further. A multi-scale density level consistency loss is added during training. On four standard benchmarks the resulting network reports the highest accuracy and sizable gains over earlier multi-column or multi-branch designs.

Core claim

The central claim is that a single-column architecture built from three cascaded dense dilated convolution blocks, linked by dense residual connections and trained with a consistency loss, can represent a wider and more continuous range of scales than previous multi-column or multi-branch networks, yielding the best reported counts on ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD.

What carries the argument

The dense dilated convolution block, in which each dilation layer is densely connected to the others to keep information from many scales while selected rates prevent gridding artifacts.

If this is right

  • DSNet reports the highest accuracy among compared methods on all four datasets.
  • Relative error reductions reach 30 percent on UCF-QNRF and UCF_CC_50 and 20 percent on the remaining two sets.
  • The network trains end-to-end without column- or branch-specific hyper-parameters.
  • The added multi-scale density level consistency loss contributes to the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block structure could be inserted into other dense-prediction tasks that suffer from scale variation, such as semantic segmentation.
  • If the block works as described, future counting work may need fewer hand-designed multi-branch modules.
  • Applying the network to video sequences could test whether the scale coverage also improves temporal consistency.

Load-bearing premise

The specific dilation rates inside each block plus the dense residual links across blocks will cover the needed scales without gridding artifacts or extra per-dataset tuning.

What would settle it

Measure counting error on a new crowd dataset whose scale distribution lies clearly outside the ranges seen in the four evaluated sets; if error does not stay below prior state-of-the-art methods the central claim is weakened.

Figures

Figures reproduced from arXiv: 1906.09707 by Feng Dai, Hao Liu, Juan Cao, Qiang Zhao, Yike Ma, Yongdong Zhang.

Figure 1
Figure 1. Figure 1: Large scale variations exist in crowd counting datasets. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of the proposed dense scale network (DSNet) for crowd counting. The DSNet consists of backbone network [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of DDCB’s scale diversity corresponding to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustration of estimated density maps and crowd counts generated by proposed DSNet. The first row shows four samples [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Crowd counting has been widely studied by computer vision community in recent years. Due to the large scale variation, it remains to be a challenging task. Previous methods adopt either multi-column CNN or single-column CNN with multiple branches to deal with this problem. However, restricted by the number of columns or branches, these methods can only capture a few different scales and have limited capability. In this paper, we propose a simple but effective network called DSNet for crowd counting, which can be easily trained in an end-to-end fashion. The key component of our network is the dense dilated convolution block, in which each dilation layer is densely connected with the others to preserve information from continuously varied scales. The dilation rates in dilation layers are carefully selected to prevent the block from gridding artifacts. To further enlarge the range of scales covered by the network, we cascade three blocks and link them with dense residual connections. We also introduce a novel multi-scale density level consistency loss for performance improvement. To evaluate our method, we compare it with state-of-the-art algorithms on four crowd counting datasets (ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD). Experimental results demonstrate that DSNet can achieve the best performance and make significant improvements on all the four datasets (30% on the UCF-QNRF and UCF_CC_50, and 20% on the others).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes DSNet, a single-column CNN for crowd counting that employs dense dilated convolution blocks (with dilation rates selected to avoid gridding) densely connected across layers, cascaded in three blocks with dense residual links, plus a multi-scale density level consistency loss. It claims state-of-the-art results on ShanghaiTech, UCF-QNRF, UCF_CC_50 and UCSD, with 20-30% gains over prior methods.

Significance. If the empirical claims are reproducible and the gains are shown to stem from the architecture rather than tuning, the work would be significant for demonstrating that dense residual connections among dilated convolutions can capture continuous scale ranges in a compact network, offering a simpler alternative to multi-column or multi-branch designs for scale variation in crowd counting.

major comments (3)
  1. [Method] Method section (dense dilated convolution block description): the central architectural claim attributes performance gains to 'carefully selected' dilation rates that prevent gridding while covering continuously varied scales, yet no explicit rates, selection rule, or derivation is supplied; without this, it is impossible to assess whether the rates generalize or are tuned to the scale statistics of the four evaluation datasets.
  2. [Experiments] Experiments section: no ablation results are reported on the dilation rates, the choice of three cascaded blocks, the dense residual connections, or the loss weighting coefficients, so the 30% improvement on UCF-QNRF and UCF_CC_50 cannot be confidently attributed to the proposed dense connections rather than hyper-parameter search.
  3. [Abstract] Abstract and experimental claims: performance numbers are stated without reference to the evaluation protocol, baseline implementations, error bars, or statistical tests, which is load-bearing for the headline claim of 'best performance' and 'significant improvements'.
minor comments (1)
  1. [Abstract] The abstract does not specify the evaluation metrics (MAE/MSE) used to report the 20-30% gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive review. We address each major comment point by point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Method] Method section (dense dilated convolution block description): the central architectural claim attributes performance gains to 'carefully selected' dilation rates that prevent gridding while covering continuously varied scales, yet no explicit rates, selection rule, or derivation is supplied; without this, it is impossible to assess whether the rates generalize or are tuned to the scale statistics of the four evaluation datasets.

    Authors: We agree the rates and rule must be explicit. Rates were chosen (1,2,3; 1,2,4; 1,2,5 across the three blocks) to avoid gridding by ensuring no shared factors between successive rates while spanning continuous scales; the rule derives from the gridding condition in dilated convolution literature. We will add the exact rates, table, and derivation to the method section. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation results are reported on the dilation rates, the choice of three cascaded blocks, the dense residual connections, or the loss weighting coefficients, so the 30% improvement on UCF-QNRF and UCF_CC_50 cannot be confidently attributed to the proposed dense connections rather than hyper-parameter search.

    Authors: The referee correctly notes that ablations are required for attribution. The original submission emphasized end-to-end SOTA comparisons; we will add ablations on dilation rates, block count, dense residuals, and loss weights in the revised experiments. revision: yes

  3. Referee: [Abstract] Abstract and experimental claims: performance numbers are stated without reference to the evaluation protocol, baseline implementations, error bars, or statistical tests, which is load-bearing for the headline claim of 'best performance' and 'significant improvements'.

    Authors: We will revise the abstract and experiments to state that MAE/MSE follow the standard test splits defined in each dataset paper, baselines are from the cited original works, and note single-run results (standard in the field). Error bars and tests were not computed originally. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical architecture proposal

full rationale

The paper proposes DSNet as an empirical CNN architecture with dense dilated blocks, selected dilation rates, residual connections, and a consistency loss, then reports experimental results on public datasets. No equations, derivations, or first-principles predictions exist that could reduce to inputs by construction. No self-citation chains, fitted parameters renamed as predictions, or uniqueness theorems are invoked. The design choices are presented as engineering decisions validated by ablation and comparison, making the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed architecture and loss; the only explicit free parameters are the dilation rates inside each block and any weighting coefficients inside the consistency loss, both chosen by the authors.

free parameters (2)
  • dilation rates
    Selected to prevent gridding artifacts while covering continuous scales; values are not derived from first principles.
  • loss weighting coefficients
    Weights for the multi-scale density level consistency loss are introduced to improve performance.

pith-pipeline@v0.9.0 · 5782 in / 1201 out tokens · 30749 ms · 2026-05-25T18:02:12.534344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    Babu Sam, N

    D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan. Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3618–3626, 2018

  2. [2]

    Boominathan, S

    L. Boominathan, S. S. Kruthiventi, and R. V . Babu. Crowd- net: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia, pages 640–644. ACM, 2016

  3. [3]

    X. Cao, Z. Wang, Y . Zhao, and F. Su. Scale aggregation network for accurate and efficient crowd counting. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018

  4. [4]

    A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy pre- serving crowd monitoring: Counting people without people models or tracking. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008

  5. [5]

    A. B. Chan and N. Vasconcelos. Bayesian poisson regres- sion for crowd counting. In 2009 IEEE 12th international conference on computer vision, pages 545–551. IEEE, 2009

  6. [6]

    Glorot, A

    X. Glorot, A. Bordes, and Y . Bengio. Deep sparse recti- fier neural networks. In Proceedings of the fourteenth inter- national conference on artificial intelligence and statistics , pages 315–323, 2011

  7. [7]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016

  8. [8]

    Huang, Z

    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein- berger. Densely connected convolutional networks. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 4700–4708, 2017

  9. [9]

    Idrees, I

    H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2547–2554, 2013

  10. [10]

    Idrees, M

    H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 532–546, 2018

  11. [11]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  12. [12]

    Lempitsky and A

    V . Lempitsky and A. Zisserman. Learning to count objects in images. In Advances in neural information processing systems, pages 1324–1332, 2010

  13. [13]

    Y . Li, X. Zhang, and D. Chen. Csrnet: Dilated convo- lutional neural networks for understanding the highly con- gested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018

  14. [14]

    D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, page 1150. Ieee, 1999

  15. [15]

    Onoro-Rubio and R

    D. Onoro-Rubio and R. J. L ´opez-Sastre. Towards perspective-free object counting with deep learning. In Eu- ropean Conference on Computer Vision , pages 615–629. Springer, 2016

  16. [16]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural net- works in Python with strong GPU acceleration, 2017

  17. [17]

    V .-Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE International Conference on Computer Vision , pages 3253–3261, 2015

  18. [18]

    Ranjan, H

    V . Ranjan, H. Le, and M. Hoai. Iterative crowd counting. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 270–285, 2018

  19. [19]

    Rodriguez, I

    M. Rodriguez, I. Laptev, J. Sivic, and J.-Y . Audibert. Density-aware person detection and tracking in crowds. In 2011 International Conference on Computer Vision , pages 2423–2430. IEEE, 2011

  20. [20]

    D. Ryan, S. Denman, C. Fookes, and S. Sridharan. Crowd counting using multiple local features. In 2009 Digital Im- age Computing: Techniques and Applications, pages 81–88. IEEE, 2009

  21. [21]

    D. B. Sam, S. Surya, and R. V . Babu. Switching convo- lutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4031–4039. IEEE, 2017

  22. [22]

    Shang, H

    C. Shang, H. Ai, and B. Bai. End-to-end crowd counting via joint learning local and global count. In 2016 IEEE In- ternational Conference on Image Processing (ICIP) , pages 1215–1219. IEEE, 2016

  23. [23]

    Z. Shen, Y . Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5245–5254, 2018

  24. [24]

    Z. Shi, L. Zhang, Y . Liu, X. Cao, Y . Ye, M.-M. Cheng, and G. Zheng. Crowd counting with deep negative corre- lation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5382–5390, 2018

  25. [25]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  26. [26]

    V . A. Sindagi and V . M. Patel. Cnn-based cascaded multi- task learning of high-level prior and density estimation for crowd counting. In 2017 14th IEEE International Con- ference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6. IEEE, 2017

  27. [27]

    V . A. Sindagi and V . M. Patel. Generating high-quality crowd density maps using contextual pyramid cnns. InProceedings 9 of the IEEE International Conference on Computer Vision , pages 1861–1870, 2017

  28. [28]

    Walach and L

    E. Walach and L. Wolf. Learning to count with cnn boosting. In European conference on computer vision, pages 660–676. Springer, 2016

  29. [29]

    Wang and X

    M. Wang and X. Wang. Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In 2011 IEEE Conference on Computer Vision and Pattern Recognition , pages 3401–3408. IEEE, 2011

  30. [30]

    Wu and R

    B. Wu and R. Nevatia. Detection of multiple, partially oc- cluded humans in a single image by bayesian combination of edgelet part detectors. In 2005 International Conference on Computer Vision, pages 90–97. IEEE, 2005

  31. [31]

    M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3684–3692, 2018

  32. [32]

    Zhang, H

    C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd counting via deep convolutional neural networks. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 833–841, 2015

  33. [33]

    Zhang, M

    L. Zhang, M. Shi, and Q. Chen. Crowd counting via scale- adaptive convolutional neural network. In 2018 IEEE Win- ter Conference on Applications of Computer Vision (WACV), pages 1113–1121. IEEE, 2018

  34. [34]

    Zhang, D

    Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma. Single- image crowd counting via multi-column convolutional neu- ral network. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 589–597, 2016. 10