Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport
Pith reviewed 2026-05-20 11:11 UTC · model grok-4.3
The pith
Parameter-free attention mechanisms let CSRNet match or exceed the accuracy of parameterized versions for crowd counting while adding no extra model parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters.
What carries the argument
Parameter-free attention modules (PFCA channel-wise, SA spatial-wise, SimAM 3-D, and their PFCASA combination) inserted into CSRNet to enhance representational power for density map estimation without increasing parameter count.
If this is right
- Model size and computational cost stay identical to the original CSRNet.
- PFCASA delivers the best results in scenes containing fewer than 40 individuals.
- PFCA becomes more effective as crowd density rises above that level.
- The approach supports direct integration into resource-limited edge devices for real-time occupancy monitoring.
Where Pith is reading between the lines
- Real-time passenger counting could run on inexpensive onboard processors without cloud offloading.
- The same parameter-free modules might transfer to other transport vision tasks such as queue length estimation.
- A follow-up study measuring accuracy on diverse vehicle camera data would test generalization beyond the benchmark.
Load-bearing premise
Performance measured on the ShanghaiTech benchmark will transfer to real public transport camera feeds that differ in lighting, camera angles, motion blur, and passenger behavior.
What would settle it
Apply the PFCASA-augmented CSRNet to video from actual onboard public transport cameras and check whether mean absolute error stays within the range reported on ShanghaiTech.
Figures
read the original abstract
Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes integrating parameter-free attention mechanisms (PFCA, SA, SimAM, and a novel PFCASA combination) into the CSRNet backbone for crowd counting and density estimation. Motivated by public transport occupancy monitoring, it evaluates these modules on the ShanghaiTech dataset against parameterized attention baselines constrained to add no more than 1% parameters, claiming comparable or superior accuracy without increasing model size. Density-specific analysis is presented, with PFCASA performing better below 40 individuals and PFCA in denser scenes.
Significance. If supported by concrete metrics, the parameter-free approach would be valuable for edge deployment in resource-constrained public transport cameras, avoiding the parameter overhead of conventional attention sub-networks. The work explicitly names strengths such as the PFCASA combination tailored to video streams and the density-thresholded performance breakdown, but the absence of numerical results limits assessment of whether these constitute a genuine advance over existing CSRNet variants.
major comments (2)
- [Abstract] Abstract: the central claim that 'experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy' is unsupported by any quantitative metrics (MAE, MSE), baseline tables, or error bars. This is load-bearing because the evaluation on the public dataset is the sole evidence offered for the performance assertions.
- [Abstract] Abstract: the density-specific findings (PFCASA superior for scenes with fewer than 40 individuals, PFCA for higher densities) reference thresholds whose selection criteria, statistical significance, or sensitivity analysis are not described, and no per-density error breakdown or cross-validation details are supplied. This weakens the applicability claim for variable public-transport loading.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key numerical comparison (e.g., MAE on ShanghaiTech Part A/B) to allow immediate assessment of the 'comparable or superior' statement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy' is unsupported by any quantitative metrics (MAE, MSE), baseline tables, or error bars. This is load-bearing because the evaluation on the public dataset is the sole evidence offered for the performance assertions.
Authors: We agree that the abstract would be strengthened by the inclusion of concrete quantitative metrics. In the revised manuscript we will update the abstract to report the key MAE and MSE values achieved by the best parameter-free variants (including PFCASA) on ShanghaiTech Part A and Part B, together with the corresponding baseline CSRNet results and the parameterized attention comparisons. These numbers are already present in the experimental tables of the full manuscript; their addition to the abstract will make the central performance claim directly verifiable. revision: yes
-
Referee: [Abstract] Abstract: the density-specific findings (PFCASA superior for scenes with fewer than 40 individuals, PFCA for higher densities) reference thresholds whose selection criteria, statistical significance, or sensitivity analysis are not described, and no per-density error breakdown or cross-validation details are supplied. This weakens the applicability claim for variable public-transport loading.
Authors: We acknowledge that the abstract does not explain the rationale for the density threshold of 40 or supply supporting statistical details. We will revise the abstract to state that the threshold was chosen after examining the empirical distribution of crowd counts in the ShanghaiTech training set. In addition, we will expand the main text (Section 4) to include a per-density MAE/MSE breakdown, a brief sensitivity analysis around the chosen threshold, and any cross-validation steps performed. These additions will better substantiate the density-dependent performance claims and their relevance to variable public-transport occupancy. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is an empirical evaluation study that applies existing CSRNet backbone and standard parameter-free attention modules (PFCA, SA, SimAM, PFCASA) to the public ShanghaiTech dataset. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-defined quantities. The central claim rests on benchmark comparisons with an external dataset and a standard model, which constitutes independent evidence rather than internal circularity. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that would force the result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CSRNet serves as a suitable backbone for density map estimation in congested scenes
invented entities (1)
-
PFCASA
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules ... Vj = (Uj − µ)² + 2(σ² + λ) / 4(σ² + λ)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
(2015) Shanghai new year crush kills 36
BBC News. (2015) Shanghai new year crush kills 36. Accessed: 2025- 07-14
work page 2015
-
[2]
158 deaths at halloween night: An accimap analysis of 2022 itaewon crowd crush in south korea,
C. Son, D.-H. Ham, S. Jin, and T. Park, “158 deaths at halloween night: An accimap analysis of 2022 itaewon crowd crush in south korea,”Safety Science, vol. 184, p. 106741, 2025
work page 2022
-
[3]
Opti- mization of passenger distribution at metro stations through a guidance system,
J. C ¸ apalar, A. Nemec, C. Zahradnik, and C. Olaverri-Monreal, “Opti- mization of passenger distribution at metro stations through a guidance system,” inComputer Aided Systems Theory - EUROCAST 2017 - 16th International Conference, Las Palmas de Gran Canaria, Spain, Febru- ary 19-24, 2017, Revised Selected Papers, Part II, ser. Lecture Notes in Computer ...
work page 2017
-
[4]
X. Zhou, “Optimization analysis of the transportation organization during each peak period of guangzhou metro line 3 (including the third north line),”Technol. Develop. Enterprise, vol. 34, no. 20, pp. 72–74, 2015
work page 2015
-
[5]
Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,
Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” in2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 1091–1100
work page 2018
-
[6]
Combinatorial progressive architecture search for crowd counting,
Q. Li, C. Ma, H. Chen, X. Chen, and X. Yang, “Combinatorial progressive architecture search for crowd counting,”Displays, vol. 83, p. 102686, 2024
work page 2024
-
[7]
Parameter-free channel attention for image classification and super-resolution,
Y . Shi, L. Yang, W. An, X. Zhen, and L. Wang, “Parameter-free channel attention for image classification and super-resolution,”arXiv preprint arXiv:2303.11055, 2023
-
[8]
Parameter-Free Spatial Attention Network for Person Re-Identification
H. Wang, Y . Fan, Z. Wang, L. Jiao, and B. Schiele, “Parameter- free spatial attention network for person re-identification,”CoRR, vol. abs/1811.12150, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Simam: A simple, parameter-free attention module for convolutional neural networks,
L. Yang, R.-Y . Zhang, L. Li, and X. Xie, “Simam: A simple, parameter-free attention module for convolutional neural networks,” inInternational conference on machine learning. PMLR, 2021, pp. 11 863–11 874
work page 2021
-
[10]
Single-image crowd counting via multi-column convolutional neural network,
Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597
work page 2016
-
[11]
Approaches on crowd counting and density estimation: a review,
B. Li, H. Huang, A. Zhang, P. Liu, and C. Liu, “Approaches on crowd counting and density estimation: a review,”Pattern Analysis and Applications, vol. 24, no. 3, pp. 853–874, 2021
work page 2021
-
[12]
Learning to count objects in images,
V . S. Lempitsky and A. Zisserman, “Learning to count objects in images,” inAdvances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems
-
[13]
Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, J. D. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates, Inc., 2010, pp. 1324–1332
work page 2010
-
[14]
Density-aware person detection and tracking in crowds,
M. Rodriguez, I. Laptev, J. Sivic, and J. Audibert, “Density-aware person detection and tracking in crowds,” inIEEE International Con- ference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, D. N. Metaxas, L. Quan, A. Sanfeliu, and L. V . Gool, Eds. IEEE Computer Society, 2011, pp. 2423–2430
work page 2011
-
[15]
Soft-csrnet: Real-time dilated convolutional neural networks for crowd counting with drones,
I. Bakour, H. N. Bouchali, S. Allali, and H. Lacheheb, “Soft-csrnet: Real-time dilated convolutional neural networks for crowd counting with drones,” in2020 2nd International Workshop on Human-Centric Smart Environments for Health and Well-being (IHSH). IEEE, 2021, pp. 28–33
work page 2021
-
[16]
Crowd counting method based on improved csrnet,
H. Zhao, S. Lu, L. Wang, Z. Nie, and Y . Li, “Crowd counting method based on improved csrnet,”International Conference on Artificial Life and Robots, vol. 25, pp. 605–610, 01 2020
work page 2020
-
[17]
A location-enhanced and multiscale-friendly crowd detecting approach for tram,
R. Zhao, Z. Han, Z. Liu, H. Wang, and J. Zhong, “A location-enhanced and multiscale-friendly crowd detecting approach for tram,”IEEE Trans. Instrum. Meas., vol. 71, pp. 1–9, 2022
work page 2022
-
[18]
Scale aggregation network for accurate and efficient crowd counting,
X. Cao, Z. Wang, Y . Zhao, and F. Su, “Scale aggregation network for accurate and efficient crowd counting,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750
work page 2018
-
[19]
Crowd counting and density estimation by trellis encoder- decoder networks,
X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao, “Crowd counting and density estimation by trellis encoder- decoder networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6133–6142
work page 2019
-
[20]
Single-image crowd counting via multi-column convolutional neural network,
Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 589–597
work page 2016
-
[21]
Switching convolutional neural network for crowd counting,
D. B. Sam, S. Surya, and R. V . Babu, “Switching convolutional neural network for crowd counting,” in2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 4031–4039
work page 2017
-
[22]
Multi scale attention network for crowd count- ing,
X. Yang and X. Lu, “Multi scale attention network for crowd count- ing,” inCSAE 2021: The 5th International Conference on Computer Science and Application Engineering, Sanya, China, October 19 - 21, 2021, A. Emrouznejad and J. R. Chou, Eds. ACM, 2021, pp. 22:1– 22:8
work page 2021
-
[23]
SCAR: spatial-/channel-wise attention regression networks for crowd counting,
J. Gao, Q. Wang, and Y . Yuan, “SCAR: spatial-/channel-wise attention regression networks for crowd counting,”Neurocomputing, vol. 363, pp. 1–8, 2019
work page 2019
-
[24]
Dual Path Multi-Scale Fusion Networks with Attention for Crowd Counting
L. Zhu, Z. Zhao, C. Lu, Y . Lin, Y . Peng, and T. Yao, “Dual path multi- scale fusion networks with attention for crowd counting,”CoRR, vol. abs/1902.01115, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[25]
Cbam: Convolutional block attention module,
S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19
work page 2018
-
[26]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141
work page 2018
-
[27]
Coordinate attention for efficient mobile network design,
Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 13 713–13 722
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.