Locality-constrained Spatial Transformer Network for Video Crowd Counting

Biyun Zhan; Bo Hu; Shenghua Gao; Wandi Cai; Yanyan Fang

arxiv: 1907.07911 · v1 · pith:6AWFLRZWnew · submitted 2019-07-18 · 💻 cs.CV

Locality-constrained Spatial Transformer Network for Video Crowd Counting

Yanyan Fang , Biyun Zhan , Wandi Cai , Shenghua Gao , Bo Hu This is my paper

Pith reviewed 2026-05-24 19:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords video crowd countingspatial transformerdensity maplocality constrainttemporal relationcrowd dataset

0 comments

The pith

A locality-constrained spatial transformer estimates the next frame's density map from the current one to handle motion in video crowd counting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video crowd counting must deal with density map changes between neighboring frames caused by translation, rotation, scaling, and by people entering, exiting, or becoming occluded. The paper proposes LSTN, which first runs a CNN on each frame to produce a density map, then feeds that map into a Locality-constrained Spatial Transformer module to predict the following frame's map. This step creates a temporal link that compensates for the observed changes. The authors also release a dataset of 15K frames containing roughly 394K annotated heads from 13 scenes. Experiments on this collection and on prior datasets indicate the combined pipeline yields more accurate counts than frame-independent methods.

Core claim

LSTN generates per-frame density maps with a CNN and then applies a locality-constrained spatial transformer module that transforms the current density map to approximate the density map of the next frame, thereby relating neighboring maps to accommodate both geometric changes and variations in head count.

What carries the argument

The Locality-constrained Spatial Transformer (LST) module, which takes the current frame's density map and produces an estimate of the next frame's map under locality constraints.

If this is right

Density maps can be propagated forward in time with spatial adjustments to maintain consistency despite crowd movement.
Entry, exit, and occlusion effects become addressable through the temporal relation between consecutive maps.
A single large video dataset with 394K annotations supplies a concrete testbed for measuring such temporal corrections.
The same architecture demonstrates gains on existing crowd counting collections beyond the newly collected scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The LST module could be inserted into other video density tasks such as traffic flow or cell population tracking where local geometric shifts dominate.
Longer sequences might benefit from chaining multiple LST steps or combining them with recurrent connections to capture extended motion.
The released dataset invites direct comparisons of transformer-based temporal links against optical-flow or recurrent alternatives on the same 13 scenes.

Load-bearing premise

Changes in head density maps between neighboring frames can be modeled and corrected as locality-constrained spatial transformations applied to the current map.

What would settle it

A side-by-side test on the 15K-frame dataset that shows no reduction in counting error when the LST module is removed and each frame is processed independently by the CNN alone.

read the original abstract

Compared with single image based crowd counting, video provides the spatial-temporal information of the crowd that would help improve the robustness of crowd counting. But translation, rotation and scaling of people lead to the change of density map of heads between neighbouring frames. Meanwhile, people walking in/out or being occluded in dynamic scenes leads to the change of head counts. To alleviate these issues in video crowd counting, a Locality-constrained Spatial Transformer Network (LSTN) is proposed. Specifically, we first leverage a Convolutional Neural Networks to estimate the density map for each frame. Then to relate the density maps between neighbouring frames, a Locality-constrained Spatial Transformer (LST) module is introduced to estimate the density map of next frame with that of current frame. To facilitate the performance evaluation, a large-scale video crowd counting dataset is collected, which contains 15K frames with about 394K annotated heads captured from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments on our dataset and other crowd counting datasets validate the effectiveness of our LSTN for crowd counting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LSTN adds a locality-constrained spatial transformer to propagate per-frame density maps and releases a claimed largest video crowd dataset, but the abstract leaves unclear how the module handles count changes from entry/exit/occlusion.

read the letter

The paper's core new pieces are the LST module, which takes a CNN density map from the current frame and produces one for the next via a locality-constrained spatial transformer, plus the release of a 15K-frame video dataset with 394K annotated heads across 13 scenes. The dataset claim stands out as the largest of its kind mentioned, and the temporal link between frames is a direct response to limitations in single-image counting methods. That part is useful for anyone already working on density estimation who wants to add frame-to-frame consistency without starting from scratch. The approach identifies the right issues: geometric shifts plus count changes from movement and occlusion. Releasing the data is a concrete step that others can use for benchmarking. The soft spot is exactly the stress-test point. A spatial transformer applies a warp to the existing density field, but entry, exit, and occlusion alter total integrated count in ways that are not pure geometric transformations. The abstract describes the module as estimating the next map from the current one but gives no equation, diagram, or auxiliary pathway for injecting or removing density mass. Without that detail or any reported numbers, it is hard to judge whether the claim holds in dynamic scenes. No ablation or error analysis appears in the provided text either. This is for CV researchers focused on crowd counting who need video data or temporal extensions. A reader building on density maps might extract the dataset or the module idea, though the lack of implementation specifics limits immediate use. It deserves peer review because the dataset size and the temporal framing address a real gap, even if the mechanism needs clarification and the experiments need scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Locality-constrained Spatial Transformer Network (LSTN) for video crowd counting. A CNN first produces per-frame density maps; a Locality-constrained Spatial Transformer (LST) module then warps the current-frame density map to produce an estimate of the next frame's map, with the goal of handling translation/rotation/scaling as well as entry/exit and occlusion effects. The authors release a new video dataset (15 K frames, ~394 K heads, 13 scenes) and report that experiments on this and prior datasets demonstrate the effectiveness of LSTN.

Significance. If the LST module can be shown to correctly propagate density under non-rigid motion while also accounting for count changes, the approach would supply a lightweight temporal link between consecutive density maps without requiring explicit tracking or optical flow. The release of a large, multi-scene video counting dataset is a concrete contribution that future work can use for benchmarking.

major comments (2)

[Abstract] Abstract: the LST module is described solely as estimating the next density map from the current one via a locality-constrained spatial transformer. Because a spatial transformer realizes a geometric warp, any change in integrated density (entry, exit, or occlusion) must be synthesized by the warp itself or by an auxiliary pathway; no equation, diagram, or loss term is supplied that would permit net mass creation or destruction, which directly undermines the claim that the module alleviates entry/exit/occlusion.
[Abstract] Abstract (and implied method section): the central empirical claim rests on the assertion that the LST corrects for all listed sources of density-map change, yet the provided description contains neither the precise formulation of the locality constraint nor any ablation that isolates the contribution of the LST versus a plain CNN density estimator.

minor comments (2)

[Abstract] The abstract states that the new dataset is 'the largest' but supplies no comparison table of existing video counting datasets (frame count, annotation density, scene diversity).
[Abstract] No implementation details (backbone CNN, training schedule, loss weights, or inference procedure for combining the warped and observed maps) are given, making reproduction impossible from the current text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the LST module is described solely as estimating the next density map from the current one via a locality-constrained spatial transformer. Because a spatial transformer realizes a geometric warp, any change in integrated density (entry, exit, or occlusion) must be synthesized by the warp itself or by an auxiliary pathway; no equation, diagram, or loss term is supplied that would permit net mass creation or destruction, which directly undermines the claim that the module alleviates entry/exit/occlusion.

Authors: We agree that the abstract provides no equation, diagram, or loss term permitting net mass creation or destruction, and that a pure geometric warp cannot synthesize count changes from entry/exit/occlusion. The manuscript description does not supply an auxiliary pathway or explicit mechanism for these effects. We will revise the method section to clarify the integration of the per-frame CNN with the LST module and add discussion of how count changes are handled in practice. revision: yes
Referee: [Abstract] Abstract (and implied method section): the central empirical claim rests on the assertion that the LST corrects for all listed sources of density-map change, yet the provided description contains neither the precise formulation of the locality constraint nor any ablation that isolates the contribution of the LST versus a plain CNN density estimator.

Authors: We acknowledge that neither the abstract nor the implied method section supplies the precise mathematical formulation of the locality constraint or an ablation isolating the LST contribution. We will add the explicit formulation of the locality constraint to the method section and include a new ablation study comparing the full LSTN against a plain CNN baseline in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical NN architecture proposal with no derivation chain

full rationale

The paper proposes an LSTN architecture: a CNN estimates per-frame density maps, followed by an LST module that warps the current density map to estimate the next frame's map. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to fitted parameters or self-citations by construction. Validation relies on training and testing on external video datasets (including a newly collected one), which are independent of any internal fit. This matches the default case of an empirical proposal whose central claims do not collapse to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that single-frame CNN density estimation is reliable and that local spatial transformations suffice to model temporal evolution of crowd density; the LST module itself is an invented component without independent evidence supplied in the abstract.

axioms (1)

domain assumption Convolutional neural networks can reliably estimate crowd density maps from individual frames
The paper begins by leveraging CNNs for per-frame density estimation before applying the LST module.

invented entities (1)

Locality-constrained Spatial Transformer (LST) module no independent evidence
purpose: To estimate the next frame's density map from the current frame by constraining spatial transformations to local changes
This is a newly proposed component introduced to address temporal inconsistencies in video density maps.

pith-pipeline@v0.9.0 · 5727 in / 1363 out tokens · 32559 ms · 2026-05-24T19:59:27.070292+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a Locality-constrained Spatial Transformer (LST) module is introduced to estimate the density map of next frame with that of current frame... S(It(i,j),It+1(i,j)) = exp(−∥It(i,j)−It+1(i,j)∥²₂ / 2β²)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first leverage a Convolutional Neural Networks to estimate the density map for each frame

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Locality-constrained Spatial Transformer Network for Video Crowd Counting

INTRODUCTION Crowd counting has been widely used in computer vision be- cause of its potential applications in video surveillance, trafﬁc control, and emergency management. However, most previ- ous works [1][2][3] focus on single image based crowd count- ing. In real applications, we have videos at hand, and usually the movement of crowd is predictable an...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

Crowd counting for single image

RELA TED WORK Since our work is related to deep learning based crowd count- ing, here we only brieﬂy discuss recent works on deep learn- ing based crowd counting. Crowd counting for single image. Recent works [3][9][10] have shown the effectiveness of CNN for density map estimation in single image crowd counting. To improve the robustness of crowd countin...

work page
[3]

OUR APPROACH Our network architecture is shown in Fig. 1. It consists of two modules: density map regression module and Locality- constrained Spatial Transformer (LST) module. The density map regression module takes each frame as input and esti- mates its corresponding density map, and then the LST mod- ule takes the estimated density map as input to pred...

work page 2000
[4]

EXPERIMENTS 4.1. Evaluation metric Following work [19], we adopt both the mean absolute error (MAE) and the mean squared error (MSE) as metrics to eval- uate the performance of different methods, which are deﬁned as follows: MAE = 1 T T∑ i=1 |zi− ˆzi|,MSE = √ 1 T T∑ i=1 (zi− ˆzi)2 (9) where T is the total number of frames of all testing video sequences...

work page
[5]

We also report the perfor- mance of our method without LST

which achieves state-of-the-art performance for single im- age crowd counting, ConvLSTM [8] which is state-of-the-art video crowd counting method. We also report the perfor- mance of our method without LST. All results are shown in Table. 2. We can see that our method achieves the best per- formance. Further the improvement of our method compared with the...

work page 2000
[6]

Speciﬁ- cally, we ﬁrst leverage a density map regression module to es- timate the density map of each frame

CONCLUSION In this paper, a Locality-constrained Spatial Transformer Net- work (LSTN) is proposed to explicitly relate the density maps of neighbouring frames for video crowd counting. Speciﬁ- cally, we ﬁrst leverage a density map regression module to es- timate the density map of each frame. Considering that people may walk in/out or are occluded, we div...

work page
[7]

Fast crowd density estimation with convolutional neural net- works,

M. Fu, P. Xu, X. Li, Q.Liu, M.Ye, and C.Zhu, “Fast crowd density estimation with convolutional neural net- works,” Engineering Applications of Artiﬁcial Intelli- gence, pp. 81 – 88, 2015

work page 2015
[8]

Cross-scene crowd counting via deep convolutional neural networks,

Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xi- aokang Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in CVPR, June 2015

work page 2015
[9]

Single- image crowd counting via multi-column convolutional neural network,

Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single- image crowd counting via multi-column convolutional neural network,” in CVPR, June 2016, pp. 589–597

work page 2016
[10]

Context-aware trajectory prediction,

B. Federico, L. Giuseppe, Ballan L, and A. Bimbo, “Context-aware trajectory prediction,” international conference on pattern recognition, 2017

work page 2017
[11]

Histograms of oriented gradi- ents for human detection,

N. Dalal and B. Triggs, “Histograms of oriented gradi- ents for human detection,” pp. 886–893, 2005

work page 2005
[12]

Pedestrian detection via classiﬁcation on riemannian manifolds,

Oncel Tuzel, Fatih Porikli, and Peter Meer, “Pedestrian detection via classiﬁcation on riemannian manifolds,” TPAMI, vol. 30, no. 10, pp. 1713–1727, 2008

work page 2008
[13]

Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,

S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura, “Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,” in ICCV, Oct 2017, pp. 3687–3696

work page 2017
[14]

Spatiotemporal model- ing for crowd counting in videos,

X. Feng, X. Shi, and D. Yeung, “Spatiotemporal model- ing for crowd counting in videos,” inICCV. IEEE, 2017, pp. 5161–5169

work page 2017
[15]

Switching convolutional neural network for crowd counting,

Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu, “Switching convolutional neural network for crowd counting,” in CVPR, July 2017

work page 2017
[16]

Csrnet: Dilated con- volutional neural networks for understanding the highly congested scenes,

Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated con- volutional neural networks for understanding the highly congested scenes,” in CVPR, 2018, pp. 1091–1100

work page 2018
[17]

Towards perspective-free object counting with deep learning,

Daniel D. Onoro-Rubio and R. L ´opez-Sastre, “Towards perspective-free object counting with deep learning,” in ECCV. Springer, 2016, pp. 615–629

work page 2016
[18]

Deci- denet: counting varying density crowds through atten- tion guided detection and density estimation,

J. Liu, C. Gao, D. Meng, and A. Hauptmann, “Deci- denet: counting varying density crowds through atten- tion guided detection and density estimation,” in CVPR, 2018, pp. 5197–5206

work page 2018
[19]

Composition loss for counting, density map estimation and localization in dense crowds.,

M. Tayyab H. Idrees, K. Athrey, D. Zhang, S. Al- maadeed, N. Rajpoot, and M. Shah, “Composition loss for counting, density map estimation and localization in dense crowds.,” arXiv: Computer Vision and Pattern Recognition, 2018

work page 2018
[20]

Spatial transformer networks,

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017– 2025

work page 2015
[21]

Su- pervised transformer network for efﬁcient face detec- tion,

Dong Chen, Gang Hua, Fang Wen, and Jian Sun, “Su- pervised transformer network for efﬁcient face detec- tion,” in ECCV. Springer, 2016, pp. 122–138

work page 2016
[22]

To- ward end-to-end face recognition through alignment learning,

Yuanyi Zhong, Jiansheng Chen, and Bo Huang, “To- ward end-to-end face recognition through alignment learning,” IEEE signal processing letters , vol. 24, no. 8, pp. 1213–1217, 2017

work page 2017
[23]

Recursive spatial transformer (rest) for alignment-free face recognition,

Wanglong Wu, Meina Kan, Xin Liu, Yi Yang, Shiguang Shan, and Xilin Chen, “Recursive spatial transformer (rest) for alignment-free face recognition,” in CVPR, 2017, pp. 3772–3780

work page 2017
[24]

Crowd Counting using Deep Recurrent Spatial-Aware Network

Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and Liang Lin, “Crowd counting using deep recurrent spatial-aware network,” arXiv preprint arXiv:1807.00601, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Counting in dense crowds using deep features,

Karunya Tota and Haroon Idrees, “Counting in dense crowds using deep features,” 2015

work page 2015
[26]

Privacy preserving crowd monitoring: Counting people without people models or tracking,

A. B. Chan, Zhang-Sheng John Liang, and N. Vascon- celos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in CVPR, June 2008, pp. 1–7

work page 2008
[27]

Face recognition us- ing kernel ridge regression,

S. An, W. Liu, and S. Venkatesh, “Face recognition us- ing kernel ridge regression,” in CVPR, June 2007, pp. 1–7

work page 2007
[28]

Feature mining for localised crowd counting,

Ke Chen, Chen Change Loy, Shaogang Gong, and Tao Xiang, “Feature mining for localised crowd counting,” in In BMVC

work page
[29]

Cumulative attribute space for age and crowd density estimation,

K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute space for age and crowd density estimation,” in CVPR, June 2013, pp. 2467–2474

work page 2013
[30]

Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation,

V . Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation,” in ICCV, Dec 2015, pp. 3253–3261

work page 2015

[1] [1]

Locality-constrained Spatial Transformer Network for Video Crowd Counting

INTRODUCTION Crowd counting has been widely used in computer vision be- cause of its potential applications in video surveillance, trafﬁc control, and emergency management. However, most previ- ous works [1][2][3] focus on single image based crowd count- ing. In real applications, we have videos at hand, and usually the movement of crowd is predictable an...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

Crowd counting for single image

RELA TED WORK Since our work is related to deep learning based crowd count- ing, here we only brieﬂy discuss recent works on deep learn- ing based crowd counting. Crowd counting for single image. Recent works [3][9][10] have shown the effectiveness of CNN for density map estimation in single image crowd counting. To improve the robustness of crowd countin...

work page

[3] [3]

OUR APPROACH Our network architecture is shown in Fig. 1. It consists of two modules: density map regression module and Locality- constrained Spatial Transformer (LST) module. The density map regression module takes each frame as input and esti- mates its corresponding density map, and then the LST mod- ule takes the estimated density map as input to pred...

work page 2000

[4] [4]

EXPERIMENTS 4.1. Evaluation metric Following work [19], we adopt both the mean absolute error (MAE) and the mean squared error (MSE) as metrics to eval- uate the performance of different methods, which are deﬁned as follows: MAE = 1 T T∑ i=1 |zi− ˆzi|,MSE = √ 1 T T∑ i=1 (zi− ˆzi)2 (9) where T is the total number of frames of all testing video sequences...

work page

[5] [5]

We also report the perfor- mance of our method without LST

which achieves state-of-the-art performance for single im- age crowd counting, ConvLSTM [8] which is state-of-the-art video crowd counting method. We also report the perfor- mance of our method without LST. All results are shown in Table. 2. We can see that our method achieves the best per- formance. Further the improvement of our method compared with the...

work page 2000

[6] [6]

Speciﬁ- cally, we ﬁrst leverage a density map regression module to es- timate the density map of each frame

CONCLUSION In this paper, a Locality-constrained Spatial Transformer Net- work (LSTN) is proposed to explicitly relate the density maps of neighbouring frames for video crowd counting. Speciﬁ- cally, we ﬁrst leverage a density map regression module to es- timate the density map of each frame. Considering that people may walk in/out or are occluded, we div...

work page

[7] [7]

Fast crowd density estimation with convolutional neural net- works,

M. Fu, P. Xu, X. Li, Q.Liu, M.Ye, and C.Zhu, “Fast crowd density estimation with convolutional neural net- works,” Engineering Applications of Artiﬁcial Intelli- gence, pp. 81 – 88, 2015

work page 2015

[8] [8]

Cross-scene crowd counting via deep convolutional neural networks,

Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xi- aokang Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in CVPR, June 2015

work page 2015

[9] [9]

Single- image crowd counting via multi-column convolutional neural network,

Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single- image crowd counting via multi-column convolutional neural network,” in CVPR, June 2016, pp. 589–597

work page 2016

[10] [10]

Context-aware trajectory prediction,

B. Federico, L. Giuseppe, Ballan L, and A. Bimbo, “Context-aware trajectory prediction,” international conference on pattern recognition, 2017

work page 2017

[11] [11]

Histograms of oriented gradi- ents for human detection,

N. Dalal and B. Triggs, “Histograms of oriented gradi- ents for human detection,” pp. 886–893, 2005

work page 2005

[12] [12]

Pedestrian detection via classiﬁcation on riemannian manifolds,

Oncel Tuzel, Fatih Porikli, and Peter Meer, “Pedestrian detection via classiﬁcation on riemannian manifolds,” TPAMI, vol. 30, no. 10, pp. 1713–1727, 2008

work page 2008

[13] [13]

Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,

S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura, “Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,” in ICCV, Oct 2017, pp. 3687–3696

work page 2017

[14] [14]

Spatiotemporal model- ing for crowd counting in videos,

X. Feng, X. Shi, and D. Yeung, “Spatiotemporal model- ing for crowd counting in videos,” inICCV. IEEE, 2017, pp. 5161–5169

work page 2017

[15] [15]

Switching convolutional neural network for crowd counting,

Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu, “Switching convolutional neural network for crowd counting,” in CVPR, July 2017

work page 2017

[16] [16]

Csrnet: Dilated con- volutional neural networks for understanding the highly congested scenes,

Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated con- volutional neural networks for understanding the highly congested scenes,” in CVPR, 2018, pp. 1091–1100

work page 2018

[17] [17]

Towards perspective-free object counting with deep learning,

Daniel D. Onoro-Rubio and R. L ´opez-Sastre, “Towards perspective-free object counting with deep learning,” in ECCV. Springer, 2016, pp. 615–629

work page 2016

[18] [18]

Deci- denet: counting varying density crowds through atten- tion guided detection and density estimation,

J. Liu, C. Gao, D. Meng, and A. Hauptmann, “Deci- denet: counting varying density crowds through atten- tion guided detection and density estimation,” in CVPR, 2018, pp. 5197–5206

work page 2018

[19] [19]

Composition loss for counting, density map estimation and localization in dense crowds.,

M. Tayyab H. Idrees, K. Athrey, D. Zhang, S. Al- maadeed, N. Rajpoot, and M. Shah, “Composition loss for counting, density map estimation and localization in dense crowds.,” arXiv: Computer Vision and Pattern Recognition, 2018

work page 2018

[20] [20]

Spatial transformer networks,

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017– 2025

work page 2015

[21] [21]

Su- pervised transformer network for efﬁcient face detec- tion,

Dong Chen, Gang Hua, Fang Wen, and Jian Sun, “Su- pervised transformer network for efﬁcient face detec- tion,” in ECCV. Springer, 2016, pp. 122–138

work page 2016

[22] [22]

To- ward end-to-end face recognition through alignment learning,

Yuanyi Zhong, Jiansheng Chen, and Bo Huang, “To- ward end-to-end face recognition through alignment learning,” IEEE signal processing letters , vol. 24, no. 8, pp. 1213–1217, 2017

work page 2017

[23] [23]

Recursive spatial transformer (rest) for alignment-free face recognition,

Wanglong Wu, Meina Kan, Xin Liu, Yi Yang, Shiguang Shan, and Xilin Chen, “Recursive spatial transformer (rest) for alignment-free face recognition,” in CVPR, 2017, pp. 3772–3780

work page 2017

[24] [24]

Crowd Counting using Deep Recurrent Spatial-Aware Network

Lingbo Liu, Hongjun Wang, Guanbin Li, Wanli Ouyang, and Liang Lin, “Crowd counting using deep recurrent spatial-aware network,” arXiv preprint arXiv:1807.00601, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Counting in dense crowds using deep features,

Karunya Tota and Haroon Idrees, “Counting in dense crowds using deep features,” 2015

work page 2015

[26] [26]

Privacy preserving crowd monitoring: Counting people without people models or tracking,

A. B. Chan, Zhang-Sheng John Liang, and N. Vascon- celos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in CVPR, June 2008, pp. 1–7

work page 2008

[27] [27]

Face recognition us- ing kernel ridge regression,

S. An, W. Liu, and S. Venkatesh, “Face recognition us- ing kernel ridge regression,” in CVPR, June 2007, pp. 1–7

work page 2007

[28] [28]

Feature mining for localised crowd counting,

Ke Chen, Chen Change Loy, Shaogang Gong, and Tao Xiang, “Feature mining for localised crowd counting,” in In BMVC

work page

[29] [29]

Cumulative attribute space for age and crowd density estimation,

K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute space for age and crowd density estimation,” in CVPR, June 2013, pp. 2467–2474

work page 2013

[30] [30]

Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation,

V . Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation,” in ICCV, Dec 2015, pp. 3253–3261

work page 2015