pith. sign in

arxiv: 2605.18349 · v1 · pith:J2WO64YAnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

Pith reviewed 2026-05-20 11:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords crowd countingparameter-free attentionCSRNetpublic transportdensity estimationoccupancy monitoringShanghaiTechattention mechanisms
0
0 comments X

The pith

Parameter-free attention mechanisms let CSRNet match or exceed the accuracy of parameterized versions for crowd counting while adding no extra model parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether parameter-free attention modules can improve the CSRNet backbone for estimating crowd density in public transport settings. It evaluates channel-wise PFCA, spatial SA, 3-D SimAM, and a new PFCASA combination on the ShanghaiTech dataset against attention modules that add up to 1 percent more parameters. The experiments show these zero-parameter additions reach comparable or better counting accuracy. The work matters because public transport vehicles need lightweight models that run on edge hardware to monitor occupancy from sparse to dense conditions. Performance varies by density, with PFCASA stronger below 40 people and PFCA stronger at higher densities.

Core claim

Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters.

What carries the argument

Parameter-free attention modules (PFCA channel-wise, SA spatial-wise, SimAM 3-D, and their PFCASA combination) inserted into CSRNet to enhance representational power for density map estimation without increasing parameter count.

If this is right

  • Model size and computational cost stay identical to the original CSRNet.
  • PFCASA delivers the best results in scenes containing fewer than 40 individuals.
  • PFCA becomes more effective as crowd density rises above that level.
  • The approach supports direct integration into resource-limited edge devices for real-time occupancy monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time passenger counting could run on inexpensive onboard processors without cloud offloading.
  • The same parameter-free modules might transfer to other transport vision tasks such as queue length estimation.
  • A follow-up study measuring accuracy on diverse vehicle camera data would test generalization beyond the benchmark.

Load-bearing premise

Performance measured on the ShanghaiTech benchmark will transfer to real public transport camera feeds that differ in lighting, camera angles, motion blur, and passenger behavior.

What would settle it

Apply the PFCASA-augmented CSRNet to video from actual onboard public transport cameras and check whether mean absolute error stays within the range reported on ShanghaiTech.

Figures

Figures reproduced from arXiv: 2605.18349 by Aida Rostamza, Cristina Olaverri-Monreal, Enrico Del Re, Joshua Cherian Varughese.

Figure 1
Figure 1. Figure 1: Figure shows the setup used for investigating the impact of integrating attention modules between the frontend (VGG-16) and the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Histogram showing the distribution of the number of people [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model performance in terms of accuracy for crowd densities [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Figures (a) and (b) show the accuracy comparison at different [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes integrating parameter-free attention mechanisms (PFCA, SA, SimAM, and a novel PFCASA combination) into the CSRNet backbone for crowd counting and density estimation. Motivated by public transport occupancy monitoring, it evaluates these modules on the ShanghaiTech dataset against parameterized attention baselines constrained to add no more than 1% parameters, claiming comparable or superior accuracy without increasing model size. Density-specific analysis is presented, with PFCASA performing better below 40 individuals and PFCA in denser scenes.

Significance. If supported by concrete metrics, the parameter-free approach would be valuable for edge deployment in resource-constrained public transport cameras, avoiding the parameter overhead of conventional attention sub-networks. The work explicitly names strengths such as the PFCASA combination tailored to video streams and the density-thresholded performance breakdown, but the absence of numerical results limits assessment of whether these constitute a genuine advance over existing CSRNet variants.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy' is unsupported by any quantitative metrics (MAE, MSE), baseline tables, or error bars. This is load-bearing because the evaluation on the public dataset is the sole evidence offered for the performance assertions.
  2. [Abstract] Abstract: the density-specific findings (PFCASA superior for scenes with fewer than 40 individuals, PFCA for higher densities) reference thresholds whose selection criteria, statistical significance, or sensitivity analysis are not described, and no per-density error breakdown or cross-validation details are supplied. This weakens the applicability claim for variable public-transport loading.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key numerical comparison (e.g., MAE on ShanghaiTech Part A/B) to allow immediate assessment of the 'comparable or superior' statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy' is unsupported by any quantitative metrics (MAE, MSE), baseline tables, or error bars. This is load-bearing because the evaluation on the public dataset is the sole evidence offered for the performance assertions.

    Authors: We agree that the abstract would be strengthened by the inclusion of concrete quantitative metrics. In the revised manuscript we will update the abstract to report the key MAE and MSE values achieved by the best parameter-free variants (including PFCASA) on ShanghaiTech Part A and Part B, together with the corresponding baseline CSRNet results and the parameterized attention comparisons. These numbers are already present in the experimental tables of the full manuscript; their addition to the abstract will make the central performance claim directly verifiable. revision: yes

  2. Referee: [Abstract] Abstract: the density-specific findings (PFCASA superior for scenes with fewer than 40 individuals, PFCA for higher densities) reference thresholds whose selection criteria, statistical significance, or sensitivity analysis are not described, and no per-density error breakdown or cross-validation details are supplied. This weakens the applicability claim for variable public-transport loading.

    Authors: We acknowledge that the abstract does not explain the rationale for the density threshold of 40 or supply supporting statistical details. We will revise the abstract to state that the threshold was chosen after examining the empirical distribution of crowd counts in the ShanghaiTech training set. In addition, we will expand the main text (Section 4) to include a per-density MAE/MSE breakdown, a brief sensitivity analysis around the chosen threshold, and any cross-validation steps performed. These additions will better substantiate the density-dependent performance claims and their relevance to variable public-transport occupancy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical evaluation study that applies existing CSRNet backbone and standard parameter-free attention modules (PFCA, SA, SimAM, PFCASA) to the public ShanghaiTech dataset. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-defined quantities. The central claim rests on benchmark comparisons with an external dataset and a standard model, which constitutes independent evidence rather than internal circularity. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that would force the result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper builds on the established CSRNet architecture and previously published parameter-free attention modules; the primary addition is their combination and application to public transport video streams.

axioms (1)
  • domain assumption CSRNet serves as a suitable backbone for density map estimation in congested scenes
    The model is adopted without new justification or ablation in the abstract.
invented entities (1)
  • PFCASA no independent evidence
    purpose: A custom combination of PFCA and SA attention to leverage strengths in low-density public transport scenes
    Newly proposed in this work with no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5802 in / 1297 out tokens · 52896 ms · 2026-05-20T11:11:56.353471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    (2015) Shanghai new year crush kills 36

    BBC News. (2015) Shanghai new year crush kills 36. Accessed: 2025- 07-14

  2. [2]

    158 deaths at halloween night: An accimap analysis of 2022 itaewon crowd crush in south korea,

    C. Son, D.-H. Ham, S. Jin, and T. Park, “158 deaths at halloween night: An accimap analysis of 2022 itaewon crowd crush in south korea,”Safety Science, vol. 184, p. 106741, 2025

  3. [3]

    Opti- mization of passenger distribution at metro stations through a guidance system,

    J. C ¸ apalar, A. Nemec, C. Zahradnik, and C. Olaverri-Monreal, “Opti- mization of passenger distribution at metro stations through a guidance system,” inComputer Aided Systems Theory - EUROCAST 2017 - 16th International Conference, Las Palmas de Gran Canaria, Spain, Febru- ary 19-24, 2017, Revised Selected Papers, Part II, ser. Lecture Notes in Computer ...

  4. [4]

    Optimization analysis of the transportation organization during each peak period of guangzhou metro line 3 (including the third north line),

    X. Zhou, “Optimization analysis of the transportation organization during each peak period of guangzhou metro line 3 (including the third north line),”Technol. Develop. Enterprise, vol. 34, no. 20, pp. 72–74, 2015

  5. [5]

    Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,

    Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” in2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 1091–1100

  6. [6]

    Combinatorial progressive architecture search for crowd counting,

    Q. Li, C. Ma, H. Chen, X. Chen, and X. Yang, “Combinatorial progressive architecture search for crowd counting,”Displays, vol. 83, p. 102686, 2024

  7. [7]

    Parameter-free channel attention for image classification and super-resolution,

    Y . Shi, L. Yang, W. An, X. Zhen, and L. Wang, “Parameter-free channel attention for image classification and super-resolution,”arXiv preprint arXiv:2303.11055, 2023

  8. [8]

    Parameter-Free Spatial Attention Network for Person Re-Identification

    H. Wang, Y . Fan, Z. Wang, L. Jiao, and B. Schiele, “Parameter- free spatial attention network for person re-identification,”CoRR, vol. abs/1811.12150, 2018

  9. [9]

    Simam: A simple, parameter-free attention module for convolutional neural networks,

    L. Yang, R.-Y . Zhang, L. Li, and X. Xie, “Simam: A simple, parameter-free attention module for convolutional neural networks,” inInternational conference on machine learning. PMLR, 2021, pp. 11 863–11 874

  10. [10]

    Single-image crowd counting via multi-column convolutional neural network,

    Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597

  11. [11]

    Approaches on crowd counting and density estimation: a review,

    B. Li, H. Huang, A. Zhang, P. Liu, and C. Liu, “Approaches on crowd counting and density estimation: a review,”Pattern Analysis and Applications, vol. 24, no. 3, pp. 853–874, 2021

  12. [12]

    Learning to count objects in images,

    V . S. Lempitsky and A. Zisserman, “Learning to count objects in images,” inAdvances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems

  13. [13]

    Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, J. D. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates, Inc., 2010, pp. 1324–1332

  14. [14]

    Density-aware person detection and tracking in crowds,

    M. Rodriguez, I. Laptev, J. Sivic, and J. Audibert, “Density-aware person detection and tracking in crowds,” inIEEE International Con- ference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, D. N. Metaxas, L. Quan, A. Sanfeliu, and L. V . Gool, Eds. IEEE Computer Society, 2011, pp. 2423–2430

  15. [15]

    Soft-csrnet: Real-time dilated convolutional neural networks for crowd counting with drones,

    I. Bakour, H. N. Bouchali, S. Allali, and H. Lacheheb, “Soft-csrnet: Real-time dilated convolutional neural networks for crowd counting with drones,” in2020 2nd International Workshop on Human-Centric Smart Environments for Health and Well-being (IHSH). IEEE, 2021, pp. 28–33

  16. [16]

    Crowd counting method based on improved csrnet,

    H. Zhao, S. Lu, L. Wang, Z. Nie, and Y . Li, “Crowd counting method based on improved csrnet,”International Conference on Artificial Life and Robots, vol. 25, pp. 605–610, 01 2020

  17. [17]

    A location-enhanced and multiscale-friendly crowd detecting approach for tram,

    R. Zhao, Z. Han, Z. Liu, H. Wang, and J. Zhong, “A location-enhanced and multiscale-friendly crowd detecting approach for tram,”IEEE Trans. Instrum. Meas., vol. 71, pp. 1–9, 2022

  18. [18]

    Scale aggregation network for accurate and efficient crowd counting,

    X. Cao, Z. Wang, Y . Zhao, and F. Su, “Scale aggregation network for accurate and efficient crowd counting,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750

  19. [19]

    Crowd counting and density estimation by trellis encoder- decoder networks,

    X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao, “Crowd counting and density estimation by trellis encoder- decoder networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6133–6142

  20. [20]

    Single-image crowd counting via multi-column convolutional neural network,

    Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 589–597

  21. [21]

    Switching convolutional neural network for crowd counting,

    D. B. Sam, S. Surya, and R. V . Babu, “Switching convolutional neural network for crowd counting,” in2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 4031–4039

  22. [22]

    Multi scale attention network for crowd count- ing,

    X. Yang and X. Lu, “Multi scale attention network for crowd count- ing,” inCSAE 2021: The 5th International Conference on Computer Science and Application Engineering, Sanya, China, October 19 - 21, 2021, A. Emrouznejad and J. R. Chou, Eds. ACM, 2021, pp. 22:1– 22:8

  23. [23]

    SCAR: spatial-/channel-wise attention regression networks for crowd counting,

    J. Gao, Q. Wang, and Y . Yuan, “SCAR: spatial-/channel-wise attention regression networks for crowd counting,”Neurocomputing, vol. 363, pp. 1–8, 2019

  24. [24]

    Dual Path Multi-Scale Fusion Networks with Attention for Crowd Counting

    L. Zhu, Z. Zhao, C. Lu, Y . Lin, Y . Peng, and T. Yao, “Dual path multi- scale fusion networks with attention for crowd counting,”CoRR, vol. abs/1902.01115, 2019

  25. [25]

    Cbam: Convolutional block attention module,

    S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19

  26. [26]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141

  27. [27]

    Coordinate attention for efficient mobile network design,

    Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 13 713–13 722