pith. sign in

arxiv: 1906.10313 · v3 · pith:MWTXAA6Onew · submitted 2019-06-25 · 💻 cs.RO

DensePeds: Pedestrian Tracking in Dense Crowds Using Front-RVO and Sparse Features

Pith reviewed 2026-05-25 16:56 UTC · model grok-4.3

classification 💻 cs.RO
keywords pedestrian trackingdense crowdsFront-RVOmotion modelsparse featurescollision avoidanceMask R-CNN
0
0 comments X

The pith

DensePeds tracks pedestrians in dense crowds using Front-RVO motion prediction and sparse features from Mask R-CNN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DensePeds as a method for tracking individuals in crowds denser than two pedestrians per square meter, using front or elevated camera views. It introduces Front-RVO, a motion model based on collision avoidance, and pairs it with sparse feature vectors from Mask R-CNN to limit track losses. On standard MOT benchmarks the approach runs 4.5 times faster than earlier algorithms, while on a new dense crowd dataset it raises average accuracy by more than 2.6 percent. A reader would care if reliable single-person tracking becomes feasible in high-density settings where most prior systems fragment. The central premise is that a tailored motion model plus efficient features can preserve continuity without heavy computation.

Core claim

DensePeds uses Front-RVO, a motion model that incorporates collision avoidance constraints for predicting pedestrian movements from front-facing cameras, together with sparse feature vectors computed by Mask R-CNN, to reduce false negatives. This combination yields 4.5 times the speed of prior trackers on MOT benchmarks and improves state-of-the-art accuracy by over 2.6 percent on average in dense crowd videos.

What carries the argument

Front-RVO, a motion model encoding collision avoidance constraints for front-view pedestrian prediction, integrated with sparse feature vectors from Mask R-CNN to sustain track continuity.

If this is right

  • Individual pedestrian tracks remain stable longer when density exceeds two people per square meter.
  • Processing runs fast enough for real-time operation on standard hardware.
  • Fewer track losses support downstream uses such as crowd counting or behavior analysis.
  • The method combines readily with existing detection networks without requiring new end-to-end training.
  • Gains hold across both standard MOT sequences and specialized dense-crowd test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The collision-avoidance formulation in Front-RVO could be adapted for side or overhead views if perspective effects are modeled.
  • Sparse features might reduce compute in other real-time vision pipelines that currently rely on dense descriptors.
  • Creation of the new dense dataset highlights the value of targeted benchmarks for extreme conditions.
  • If the speed advantage persists, hybrid motion-plus-feature designs may compete with heavier learned models in resource-constrained settings.

Load-bearing premise

The new dense crowd dataset and the evaluation protocol used to measure the accuracy gain accurately reflect real-world performance without post-hoc selection of test sequences or metrics.

What would settle it

An independent test on a separate collection of dense crowd videos that shows no accuracy or speed advantage over existing methods would disprove the performance claims.

Figures

Figures reproduced from arXiv: 1906.10313 by Aniket Bera, Dinesh Manocha, Rohan Chandra, Uttaran Bhattacharya.

Figure 1
Figure 1. Figure 1: Performance of our pedestrian tracking algorithm on the NPLACE-1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: (left) A circular representation results in many false δ￾overlaps.(right) FRVO efficiently models pedestrians using elliptical rep￾resentations that on average cause δ → 0. Sequence: IITF-1. ∆ is upper-bounded by a function of δ. Observe that the error bound increases for higher δ. This is interpreted as follows: the more we increase the δ-overlap, the transparent gray cone will correspondingly shrink, and… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Standard VO configuration fundamental to RVO, using [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our real-time pedestrian tracking algorithm, DensePeds. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of DensePeds on the NPLACE-2 sequence consisting of 144 pedestrians in the video. Frames are chosen with a gap of 4 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

We present a pedestrian tracking algorithm, DensePeds, that tracks individuals in highly dense crowds (greater than 2 pedestrians per square meter). Our approach is designed for videos captured from front-facing or elevated cameras. We present a new motion model called Front-RVO (FRVO) for predicting pedestrian movements in dense situations using collision avoidance constraints and combine it with state-of-the-art Mask R-CNN to compute sparse feature vectors that reduce the loss of pedestrian tracks (false negatives). We evaluate DensePeds on the standard MOT benchmarks as well as a new dense crowd dataset. In practice, our approach is 4.5 times faster than prior tracking algorithms on the MOT benchmark and we are state-of-the-art in dense crowd videos by over 2.6% on the absolute scale on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents DensePeds, a pedestrian tracking algorithm for dense crowds (>2 pedestrians per m²) captured from front-facing or elevated cameras. It introduces the Front-RVO motion model incorporating collision avoidance and combines it with Mask R-CNN to generate sparse feature vectors that reduce track loss. The method is evaluated on standard MOT benchmarks and a newly introduced dense crowd dataset, claiming 4.5× faster runtime than prior trackers on MOT and state-of-the-art performance on dense videos by >2.6% on average.

Significance. If the performance claims hold under independent evaluation, the work could improve tracking reliability in high-density scenarios relevant to surveillance and autonomous systems. The introduction of a dedicated dense-crowd dataset is a potential contribution, though its value depends on transparent collection and split protocols. The reported speed-up on an established benchmark is a concrete, verifiable strength.

major comments (2)
  1. [Evaluation on new dense crowd dataset (abstract and results section)] The central SOTA claim of >2.6% absolute improvement on dense videos rests entirely on the newly introduced dataset. No details are provided on dataset collection procedure, sequence selection criteria, density statistics across sequences, or the train/test split protocol, making it impossible to determine whether the reported gain is independent of method-specific tuning or post-hoc sequence choice.
  2. [Method and results] The abstract states that Front-RVO is combined with Mask R-CNN sparse features to reduce false negatives, yet no ablation is referenced that isolates the contribution of each component on the dense dataset; without such controls the attribution of the 2.6% gain remains unclear.
minor comments (1)
  1. [Abstract] The abstract claims 'state-of-the-art in dense crowd videos by over 2.6% on the absolute scale on average' without naming the competing methods or the exact metric (MOTA, IDF1, etc.) used for the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested details and analyses.

read point-by-point responses
  1. Referee: [Evaluation on new dense crowd dataset (abstract and results section)] The central SOTA claim of >2.6% absolute improvement on dense videos rests entirely on the newly introduced dataset. No details are provided on dataset collection procedure, sequence selection criteria, density statistics across sequences, or the train/test split protocol, making it impossible to determine whether the reported gain is independent of method-specific tuning or post-hoc sequence choice.

    Authors: We agree that the current manuscript lacks sufficient documentation on the new dense crowd dataset. In the revision we will add a dedicated subsection that describes the collection procedure, sequence selection criteria, per-sequence density statistics, and the train/test split protocol. This addition will allow readers to assess the independence of the reported gains. revision: yes

  2. Referee: [Method and results] The abstract states that Front-RVO is combined with Mask R-CNN sparse features to reduce false negatives, yet no ablation is referenced that isolates the contribution of each component on the dense dataset; without such controls the attribution of the 2.6% gain remains unclear.

    Authors: We acknowledge that the manuscript does not contain an ablation isolating Front-RVO from the Mask R-CNN sparse features on the dense dataset. We will perform the required experiments and include an ablation study in the revised results section that reports tracking metrics with and without each component. revision: yes

Circularity Check

0 steps flagged

No circularity in performance claims or derivation

full rationale

The paper describes an algorithmic pipeline (Front-RVO motion model combined with Mask R-CNN sparse features) whose outputs are measured via standard tracking metrics on the established MOT benchmark and a newly introduced dense-crowd dataset. No equations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs; the reported speed-up (4.5×) and accuracy gain (2.6 %) are direct empirical measurements rather than tautological re-statements of the method itself. The evaluation protocol therefore remains independent of the claimed results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the introduction of the Front-RVO model itself.

pith-pipeline@v0.9.0 · 5676 in / 1156 out tokens · 17093 ms · 2026-05-25T16:56:57.668959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [1]

    Aggressive, tense or shy? identifying personality traits from crowd videos,

    A. Bera, T. Randhavane, and D. Manocha, “Aggressive, tense or shy? identifying personality traits from crowd videos,” in IJCAI, 2017

  2. [2]

    Traphic: Tra- jectory prediction in dense and heterogeneous traffic using weighted interactions,

    R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic: Tra- jectory prediction in dense and heterogeneous traffic using weighted interactions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  3. [3]

    Practical object recognition in au- tonomous driving and beyond,

    A. Teichman and S. Thrun, “Practical object recognition in au- tonomous driving and beyond,” in Advanced Robotics and its Social Impacts, pp. 35–38, IEEE, 2011

  4. [4]

    Structure preserving object track- ing,

    L. Zhang and L. van der Maaten, “Structure preserving object track- ing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1838–1845, 2013

  5. [5]

    Single and multiple object tracking using log-euclidean riemannian subspace and block-division appearance model,

    W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang, “Single and multiple object tracking using log-euclidean riemannian subspace and block-division appearance model,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 34, no. 12, pp. 2420–2440, 2012

  6. [6]

    Reciprocal n-body collision avoidance,

    J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics research, pp. 3–19, Springer, 2011

  7. [7]

    Mask R-CNN,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” ArXiv e-prints, Mar. 2017

  8. [8]

    MOT16: A Benchmark for Multi-Object Tracking,

    A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for Multi-Object Tracking,” ArXiv e-prints, Mar. 2016

  9. [9]

    Histograms of oriented gradients for human detection,

    N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR

  10. [10]

    IEEE Computer Society Conference on , vol. 1, pp. 886–893, IEEE, 2005

  11. [11]

    Distinctive image features from scale-invariant key- points,

    D. G. Lowe, “Distinctive image features from scale-invariant key- points,” International journal of computer vision , vol. 60, no. 2, pp. 91–110, 2004

  12. [12]

    Rich feature hierarchies for accurate object detection and semantic segmentation,

    R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” ArXiv e-prints, Nov. 2013

  13. [13]

    Fast r-cnn,

    R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision , pp. 1440–1448, 2015

  14. [14]

    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” ArXiv e-prints, June 2015

  15. [15]

    Pose2seg: Human instance segmentation without detection,

    R. Li, X. Dong, Z. Cai, D. Yang, H. Huang, S.-H. Zhang, P. Rosin, and S.-M. Hu, “Pose2seg: Human instance segmentation without detection,” arXiv preprint arXiv:1803.10683 , 2018

  16. [16]

    Online multi-target tracking using recurrent neural networks,

    A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler, “Online multi-target tracking using recurrent neural networks,” in Thirty-First AAAI Conference on Artificial Intelligence , 2017

  17. [17]

    Confidence-based data association and discriminative deep appearance learning for robust online multi- object tracking,

    S.-H. Bae and K.-J. Yoon, “Confidence-based data association and discriminative deep appearance learning for robust online multi- object tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 595–610, 2018

  18. [18]

    Online multi- object tracking using cnn-based single object tracker with spatial- temporal attention mechanism,

    Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu, “Online multi- object tracking using cnn-based single object tracker with spatial- temporal attention mechanism,” 2017

  19. [19]

    Recurrent Autoregressive Networks for Online Multi-Object Tracking

    K. Fang, Y . Xiang, and S. Savarese, “Recurrent autoregres- sive networks for online multi-object tracking,” arXiv preprint arXiv:1711.02741, 2017

  20. [20]

    Simple Online and Realtime Tracking with a Deep Association Metric,

    N. Wojke, A. Bewley, and D. Paulus, “Simple Online and Realtime Tracking with a Deep Association Metric,” ArXiv e-prints, Mar. 2017

  21. [21]

    Real-Time Multiple Object Tracking - A Study on the Importance of Speed

    S. Murray, “Real-time multiple object tracking-a study on the impor- tance of speed,” arXiv preprint arXiv:1709.03572 , 2017

  22. [22]

    Continuous energy minimization for multitarget tracking,

    A. Milan, S. Roth, and K. Schindler, “Continuous energy minimization for multitarget tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 58–72, 2013

  23. [23]

    Multiple hypothesis track- ing revisited,

    C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis track- ing revisited,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 4696–4704, 2015

  24. [24]

    Fusion of head and full-body detectors for multi-object tracking,

    R. Henschel, L. Leal-Taixe, D. Cremers, and B. Rosenhahn, “Fusion of head and full-body detectors for multi-object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1428–1437, 2018

  25. [25]

    Robust local effective matching model for multi-target tracking,

    H. Sheng, L. Hao, J. Chen, Y . Zhang, and W. Ke, “Robust local effective matching model for multi-target tracking,” in Pacific Rim Conference on Multimedia , pp. 233–243, Springer, 2017

  26. [26]

    An algorithm for tracking multiple targets,

    D. Reid, “An algorithm for tracking multiple targets,” IEEE transac- tions on Automatic Control , vol. 24, no. 6, pp. 843–854, 1979

  27. [27]

    Realtime multilevel crowd tracking using reciprocal velocity obstacles,

    A. Bera and D. Manocha, “Realtime multilevel crowd tracking using reciprocal velocity obstacles,” in Pattern Recognition (ICPR), 2014 22nd International Conference on , pp. 4164–4169, IEEE, 2014

  28. [28]

    Roadtrack: Tracking road agents in dense and hetero- geneous environments,

    R. Chandra, U. Bhattacharya, T. Randhavane, A. Bera, and D. Manocha, “Roadtrack: Tracking road agents in dense and hetero- geneous environments,” arXiv preprint arXiv:1906.10712 , 2019

  29. [29]

    Adapt: real-time adaptive pedestrian tracking for crowded scenes,

    A. Bera, N. Galoppo, D. Sharlet, A. Lake, and D. Manocha, “Adapt: real-time adaptive pedestrian tracking for crowded scenes,” inRobotics and Automation (ICRA), 2014 IEEE International Conference on , pp. 1801–1808, IEEE, 2014

  30. [30]

    You’ll never walk alone: Modeling social behavior for multi-target tracking,

    S. Pellegrini, A. Ess, K. Schindler, and L. van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in 2009 IEEE 12th International Conference on Computer Vision , pp. 261– 268, Sept 2009

  31. [31]

    Who are you with and where are you going?,

    K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg, “Who are you with and where are you going?,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , pp. 1345– 1352, IEEE, 2011

  32. [32]

    Social force model for pedestrian dynam- ics,

    D. Helbing and P. Molnar, “Social force model for pedestrian dynam- ics,” Physical review E , vol. 51, no. 5, p. 4282, 1995

  33. [33]

    A predictive collision avoidance model for pedestrian simulation,

    I. Karamouzas, P. Heil, P. Van Beek, and M. H. Overmars, “A predictive collision avoidance model for pedestrian simulation,” in International Workshop on Motion in Games , pp. 41–52, Springer, 2009

  34. [34]

    Behavioral priors for detection and tracking of pedestrians in video sequences,

    G. Antonini, S. V . Martinez, M. Bierlaire, and J. P. Thiran, “Behavioral priors for detection and tracking of pedestrians in video sequences,” International Journal of Computer Vision, vol. 69, no. 2, pp. 159–180, 2006

  35. [35]

    A mobile robot that understands pedestrian spatial behaviors,

    S.-Y . Chung and H.-P. Huang, “A mobile robot that understands pedestrian spatial behaviors,” inIntelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on , pp. 5861–5866, IEEE, 2010

  36. [36]

    Real-time reciprocal collision avoidance with elliptical agents,

    A. Best, S. Narang, and D. Manocha, “Real-time reciprocal collision avoidance with elliptical agents,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on , pp. 298–305, IEEE, 2016

  37. [37]

    Motion planning in dynamic environments using velocity obstacles,

    P. Fiorini and Z. Shiller, “Motion planning in dynamic environments using velocity obstacles,” The International Journal of Robotics Re- search, vol. 17, no. 7, pp. 760–772, 1998

  38. [38]

    The hungarian method for the assignment problem,

    H. W. Kuhn, “The hungarian method for the assignment problem,” in 50 Years of Integer Programming 1958-2008 , pp. 29–47, Springer, 2010

  39. [39]

    Distance between sets,

    M. Levandowsky and D. Winter, “Distance between sets,” Nature, vol. 234, no. 5323, p. 34, 1971

  40. [40]

    Real-time multiple people tracking with deeply learned candidate selection and person re-identification,

    C. Long, A. Haizhou, Z. Zijie, and S. Chong, “Real-time multiple people tracking with deeply learned candidate selection and person re-identification,” in ICME, 2018

  41. [41]

    Learning to track: Online multi- object tracking by decision making,

    Y . Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi- object tracking by decision making,” in Proceedings of the IEEE international conference on computer vision , pp. 4705–4713, 2015

  42. [42]

    Tracking the untrackable: Learning to track multiple cues with long-term dependencies,

    A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable: Learning to track multiple cues with long-term dependencies,”

  43. [43]

    A Hybrid Data Association Framework for Robust Online Multi-Object Tracking

    M. Yang, Y . Wu, and Y . Jia, “A hybrid data association framework for robust online multi-object tracking,” arXiv preprint arXiv:1703.10764, 2017

  44. [44]

    Online multi-object tracking with convolutional neural networks,

    L. Chen, H. Ai, C. Shang, Z. Zhuang, and B. Bai, “Online multi-object tracking with convolutional neural networks,” in Image Processing (ICIP), 2017 IEEE International Conference on , pp. 645–649, IEEE, 2017

  45. [45]

    Online multi-target tracking with strong and weak detections,

    R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, “Online multi-target tracking with strong and weak detections,” in European Conference on Computer Vision , pp. 84–99, Springer, 2016

  46. [46]

    Recurrent autoregressive net- works for online multi-object tracking,

    K. Fang, Y . Xiang, and S. Savarese, “Recurrent autoregressive net- works for online multi-object tracking,” 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 466–475, 2018

  47. [47]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013

  48. [48]

    Pets 2016: Dataset and challenge,

    L. Patino, T. Cane, A. Vallee, and J. Ferryman, “Pets 2016: Dataset and challenge,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , pp. 1–8, 2016

  49. [49]

    Evaluating multiple object tracking performance: the clear mot metrics,

    K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” Journal on Image and Video Processing, vol. 2008, p. 1, 2008